Introduction

This document is intended to describe how to use controlled terminological resources (sometimes resulting in a controlled vocabulary) and how to generate such resources from more semantically rich ontologies. This document is not intended as a comprehensive or definitive guide for all such efforts, but only a descriptive document of the particular method used for this project.

What is a controlled vocabulary?

A controlled vocabulary (CV) is a semi-closed set of terms that are used for a particular purpose. Controlled vocabularies are semi-closed because they can gain or lose members, but only through a prescribed process. If there is no prescribed process for adding or removing members of a set comprising a controlled vocabulary, then the set becomes uncontrolled and is simply a list of terms. Additionally, controlled vocabularies (in virtue of their being controlled) must be managed by a source. In this case, we use the prescribed process as implemented in this repository for the mechanism that controls the vocabulary. The purpose of a controlled vocabulary is to identify a set of terms that are acceptable for entry for some field to promote consistent description and retrieval of data. CVs are simply sets of terms, which may not have accompanying descriptions, definitions, usage recommendations, notes, references, or relationships to other terms (among other things). Controlled vocabularies minimally contain terms acceptable for a specific usage (as defined by the controlling mechanism or agents). They may also contain other information about the terms that occur in the vocabulary (e.g., definitions, synonyms, examples of usage, notes, metadata) but need not contain anything other than a simple list of terms. For this reason, controlled vocabularies are computationally lightweight, but semantically poor. CVs are primarily helpful for data entry – that is, they are most helpful when they are being used to constrain the data values at a point early in the data ingest process.

What is an ontology?

An ontology is a representational artifact whose representations are intended to designate some entities (universals, defined classes, relations). An ontology comprises a taxonomy as a proper part, but has the additional semantic richness of relations and axiomatizations (in a computable formal language).

Each entity (universal, defined class, relation) in an ontology has a label as well as an identifier. The identifier is (typically) a persistent URL or IRI. The label is a term in a natural language that people use to designate the entity. Each entity in an ontology (typically) has a definition, description, usage recommendation, notes, references, and at least one specified relationship to other entities (viz., is_a relation, but often others as well).

What is the connection between a controlled vocabulary and an ontology?

There is no necessary relationship between the controlled vocabularies and ontologies, although there can be a very close relationship between them. If one is using ontologies to validate data and data structures, then generating a CV from an ontology makes the controlled vocabulary a good first-check to ensure that all ingested data conforms to the acceptable schema(s) for data integration. Moreover, allowing any CV values/terms to deviate from the ontology entity labels will introduce potential problems with data integration further along in the process. One can think of generating a CV from an ontology as an initial verification step for acceptable data.

Controlled vocabularies stand in contrast to an ontology, which contains a set of classes and relations (and sometimes individuals) meant to be representative of a portion of reality. Of course, those classes and relations (and individuals) have names or labels, so it should be (in principle) possible to generate a controlled vocabulary by extracting the names and labels of classes and relations of an ontology and calling that a “controlled vocabulary.” One of the benefits of this approach is that it is also possible, in principle, to extract other information about those classes/relations/individuals in an ontology and use that information in a controlled vocabulary as well. Since most good ontologies provide definitions of classes/relations, it should be possible to add definitions to terms in a controlled vocabulary by extracting both the label and the definition for every class/relation in an ontology (or subset thereof) and converting that into a controlled vocabulary.

There is no standard format for a controlled vocabulary, but there are standards for ontologies. This makes the move from ontology to controlled vocabulary possible without much extra work (provided the ontology developer has followed the norms and rules of ontology development). The inverse move (generating an ontology from a controlled vocabulary) is much more difficult, if it is possible at all.

There is (most likely) more information associated with a class in an ontology than one would presumably need for a term in a controlled vocabulary as well. For this reason, it is advised that anyone seeking to generate a controlled vocabulary from an ontology should carefully consider what information they would like to extract from the ontology and what information they would like to ignore. For example, many (most, if not all) classes in an ontology will be annotated with formal axioms that allow automated reasoning over the classes/individuals in an ontology. Typically, these axioms are of no use to a user of a controlled vocabulary. So, it is generally advised that one ought not include these axioms in a controlled vocabulary even though they are very important in the ontology. Minimally, it is recommended that a controlled vocabulary contain the label (term), definition, and any relevant synonyms. These are the most helpful informational resources for the user.

How do we generate a controlled vocabulary from an ontology?

Currently, we have a controlled vocabulary page that serves as a data submitter-facing page containing our controlled vocabularies that are generated from our ontologies. The page contains resources for data submitters including links to tables containing controlled vocabularies with accompanying semantic information for each entry in those controlled vocabularies (definitions, citations, alternative terms/synonyms, where applicable).

These tables are generated from our ontologies and so our controlled vocabularies (if we think of them as the tables linked from the above page) are generated directly from our ontologies. In this way, as explained above, we ensure that there is an initial data validation step before data enters our system.