Introduction to MeSH in XML format.
The National Library of Medicine has adopted the Extensible Markup Language (XML) as a standard format for its data files. The MeSH (Medical Subject Headings) vocabulary file is available in an XML format that is similar to the format and DTD developed for MEDLINE. (See MEDLINE®/PubMed® Data.) This format will be of particular interest to those who previously received MeSH data in the NLM ELHILL format and of interest to PubMed and UMLS® developers. ASCII MeSH users and new users of MeSH data may also wish to consider use of vocabulary data in the XML format.
Some data are new in XML MeSH, particularly those elements* pertaining to concepts. While this adds to the number of elements, the concept element provides a powerful way of representing term synonymy as well as other information, such as relations between concepts. It should also be noted that there is a reduction in the number of elements in previous formats. For example, the former elements MH, NM, SY, BX, SH, and QX are all different kinds of terms and so are all represented by the term and string elements in XML MeSH. A list of XML data elements is available. A conversion table is also available which lists ASCII MeSH and ELHILL MeSH elements with the corresponding element in XML MeSH.
* Note that this document often refers to 'element' as generic term for database field content. In XML 'element' is a technical term denoting the primary components designated by beginning and end tags See next numbered item, below.
2. Tagged elements: "human-legible and reasonably clear."
Instead of short mnemonics, such as 'DA' and 'EV', XML MeSH uses XML beginning and end tags, for example, <DateCreated> and <EntryVersion>. These data markers unambiguously indicate the beginning and ending of each data element instead of relying on invisible end-of-line characters. This allows data to wrap to the next line within an element, making it easier for a human (vs. a computer) to read an XML document. The possibility of wrapping data also allows tags to be more descriptive, since there is no longer a great need for minimizing the length of tags. This contributes to the goal of the official XML specification that "XML documents should be human-legible and reasonably clear." 1 One cost of this advantage is that data files are much larger than in the past but, as the XML specification also says, "Terseness in XML markup is of minimal importance." 2
3. Concepts, synonyms, and Descriptor structure
3.1 Concepts locate synonymy.
Some data elements are new in XML MeSH, independently of the new structure, primarily those elements pertaining to concepts. The concept-centric nature of MeSH is described elsewhere. 3 A concept is the common meaning shared by synonymous terms. MeSH and other vocabularies have long used concepts implicitly. With the new MeSH maintenance system introduced with 2000 MeSH, a concept can now be represented simply and precisely by a concept Unique Identifier (<ConceptUI>). Synonymous terms are those terms which share the same <ConceptUI>.
3.2 Descriptors as a class of Concepts
A Descriptor is often broader than a single concept and so may consist of a class of concepts. Concepts, in turn, correspond to a class of terms which are synonymous with each other. Thus MeSH has a three-level structure:
Descriptor Concept Term
XML format, with its hierarchical sub-element structure, lends itself to represent these levels. See example below.
3.3 UIs are persistent names for Concepts and other objects
We normally refer to each of these objects by a specific term which names the object, e.g., 'Heart', but since this name can be changed, a unchanging numeric code (UI) is assigned to each Descriptor, Concept, and most Terms.* We take advantage of this persistence, using the UI in referring to an object in another record. For example, the "See Related" reference in a record tells the user to consider another Descriptor record. Since the UI is the persistent name of the Descriptor in the SeeRelated element, the UI is included but the Descriptor name is also included.
* Permuted Terms (terms automatically generated by manually entered terms) are on the same level as manually created terms but are importantly different in that the associated <TermUI> element does not identify the term but rather refers to the term from which the Permuted Term was generated.
3.4 Data elements attach to the appropriate object
The Descriptor/Concept/Term structure also makes it possible to attach various data elements in MeSH to the appropriate object. For example, the Scope Note belongs to the concept rather than the Descriptor - a Descriptor may have several different concepts and so several different scope notes. Similarly, thesauri have long distinguished between "broader terms" and "narrower" terms, but it is clear that these are relations between concepts and only derivatively between terms of the respective concepts. The Unified Medical Language System (UMLS) Metathesaurus® has a similar structure, and this has had a significant influence on the design of XML MeSH.
4. An example
One of the most noticeable differences between previous MeSH data structures and XML MeSH is the several levels of XML elements. (The indented sub-element structure in any XML files can be viewed by using an XML browser, such as Internet Explorer 5.x.) This hierarchical structure is inherent in the XML but lends itself to the concept-oriented structure in MeSH and replaces the sub-element structure. Consider, for example, this fragment of an XML Descriptor record:
<DescriptorRecord ...><!-- Descriptor --> <DescriptorUI>D000005</DescriptorUI> <DescriptorName><String>Abdomen</String></DescriptorName> <Annotation> region & abdominal organs... </Annotation> <ConceptList> <Concept PreferredConceptYN="Y"><!-- Concept --> <ConceptUI>M0000005</ConceptUI> <ConceptName><String>Abdomen</String></ConceptName> <ScopeNote> That portion of the body that lies between the thorax and the pelvis.</ScopeNote> <TermList> <Term ... PrintFlagYN="Y" ... ><!-- Term --> <TermUI>T000012</TermUI> <String>Abdomen</String><!-- String = the term itself --> <DateCreated> <Year>1999</Year> <Month>01</Month> <Day>01</Day> </DateCreated> </Term> <Term IsPermutedTermYN="Y" LexicalTag="NON"> <TermUI>T000012</TermUI> <String>Abdomens</String> </Term> </TermList> </Concept> </ConceptList> </DescriptorRecord>
The corresponding data in ELHILL format are:
UI - D000005 MH - Abdomen AN - region & abdominal organs ... BX - Abdomens:0:00000000:0000000:@@@@@@ MS - That portion of the body that lies between the thorax and the pelvis.
The XML example illustrates the following features.
4.1 Descriptor structure
The descending order of the Descriptor/Concept/Term objects corresponds to the Descriptor structure.
4.1.1 The <String> element
In addition to the <Term> element there is also a <String> element. Why both? Why not make the term itself the content of the Term element, e.g.,
One reason is a technical XML reason. The element <Term> has sub- elements and the practice of including both element content as well as sub-elements is considered "mixed content" and is generally considered a "poor design practice" 4 in XML. Another reason, specific to MeSH, is that the <String> element is useful in the definition of the heavily used element <DescriptorName>. Using <String> in the definition is not strictly necessary - #PCDATA could have been used, but using the same sub-element for <Term> as for several other elements (<DescriptorName>, <ConceptName>,<SupplementalRecordName>, <QualifierName>,<SupplementalRecordName>) indicates that both elements have the same content, which is in fact the case.
Note that the <String> and <Term> elements are not exactly the same as the similarly named elements in the UMLS Metathesaurus. In XML MeSH there is one <Term> element for each term-string in the database, while in the UMLS Metathesaurus there can be multiple strings corresponding to a given term. Thus, the XML MeSH <Term> element is more like the UMLS string element. There is no XML MeSH element that directly corresponds to the UMLS Term data element. The MeSH <String> element is similar to the UMLS String but note that the element used in XML MeSH is used for reasons specific to XML, as noted above, not because it is a narrower type of object than the <Term> element.
As a general rule in the MeSH Descriptor structure, each child element inherits the properties of its parent and higher objects. This rule is nicely represented by the XML element hierarchy but there is nothing in the XML specification that requires inheritance. (In the language of computer science, the hierarchy is a "directed graph", which could just as well represent a flow-chart or maze diagram.)
4.2 Data Elements
Data elements are attached to the appropriate object For example, the <Annotation> element is a Descriptor property so it is a sub-element of the <DescriptorRecord> element. The Scope Note belongs with the concept and so it is a sub-element of the <Concept>
4.3 Repeating elements representing by list elements
Sub-elements are created not only by the MeSH Descriptor structure, but also by the use of "List" elements, for example, <TermList>. This is NLM's practice for handling multiply-occurring data elements. While additional levels are introduced, it has the advantage that every element at a given level is unique, which makes it simpler for both computer parsers as well as human readers to navigate the hierarchy. For example, in processing Descriptor sub-elements, you can be sure you have all the subordinate Concept elements once you have located the <ConceptList> tag.
4.4 XML attributes
There are properties which are not XML elements but appear within an element, for example, Term ... PrintFlagYN="Y". These are XML "attributes" and apply to the element with which they appear. These could have been elements instead but one advantage of attributes in these cases is that we can specify all possible values. This provides the user with additional information so the XML attribute representation was adopted where all possible values could be reasonably specified.
4.5 Reference to other records
As in the past, several MeSH data elements refer to other records, for example, the 'See Related' and 'Heading Mapped-To'. Since the UI (Unique Identifier) is the name of the record which never changes, these references employ the UI. In addition, since the familiar name of the record is the current preferred term for the preferred concept in the record, the name is included as well. For example in the Descriptor for 'Abnormalities, Drug-Induced' there is a "See Related" reference to 'Teratogens'. In XML this is represented by the following.
<SeeRelatedDescriptor> <DescriptorReferredTo> <DescriptorUI>D013723</DescriptorUI> <DescriptorName> <String>Teratogens</String> </DescriptorName> </DescriptorReferredTo> </SeeRelatedDescriptor>
The element <SeeRelatedDescriptor> is needed in order to group each pair of <DescriptorUI> and <DescriptorName> elements. The element <DescriptorReferredTo> is not strictly necessary for grouping but is used in elements which refer to both a Descriptor and Qualifier to separate the two references, for example:
<HeadingMappedTo> <DescriptorReferredTo> <DescriptorUI>D000117</DescriptorUI> <DescriptorName> <String>Acetylglucosamine</String> </DescriptorName> </DescriptorReferredTo> <QualifierReferredTo> <QualifierUI>*Q000031</QualifierUI> <QualifierName> <String>analogs & derivatives</String> </QualifierName> </QualifierReferredTo> </HeadingMappedTo>
An XML processor does not need <DescriptorReferredTo> to distinguish <DescriptorUI> from <QualifierUI>, even if they were not contiguous with the <DescriptorName> and <QualifierName>. Nevertheless, the division clearly distinguishes the Descriptor from the Qualifier portion. This applies not only to the <HeadingMappedTo> element but also to the <EntryCombination> since reference is also to both a Descriptor and Qualifier. The same rationale for <DescriptorReferredTo> does not apply to elements which refer to just Descriptors, such as the <SeeRelatedDescriptor>, and <PharmacologicalAction> elements, but the element is included for them as well for the sake of consistency.
In the XML specification the attribute type IDREF (along with the type ID) provides a similar function of referring to another unique identifier elsewhere in the database. 5 XML MeSH does not use this mechanism primarily to provide data which is similar to previous formats.
5. Unification of elements across record types
While XML MeSH includes more data elements than previous formats, the XML MeSH structure actually eliminates some elements, or unifies them in common elements. For example, the MH, NM, SY, BX, SH, and QX are all different kinds of terms and so are all represented by the <term> and <string> elements in XML MeSH.
6. Special characters
6.1 XML characters
Some MeSH data contain the ampersand ('&') and corner brackets ('>' and '<' ), which are data in MeSH but which XML processors treat as special symbols rather than as data. Therefore these symbols are represented by XML character entities:
|left angle bracket||<||<|
|right angle bracket||>||>|
6.2 Non-ASCII characters
Data in XML MeSH files are encoded in the Unicode character set, specifically UTF-8. However, most of the data are in 7-bit ASCII format, a subset of UTF-8. A relatively small number of terms and Annotations contain one or more diacritical characters, such as the acute e (é). These are coded in UTF-8 format and will be correctly displayed by UTF-8 applications. Otherwise they may appear differently in different displays. Codings for diacritics in NLM data can be found in the table MEDLINE Character Database.
1 Item 1.1.6 in version 1.0 of XML. See. DuCharme B. XML: The Annotated Specification. New Jersey: Prentice-Hall, 1999, p. 52. Specification is also at: http://www.w3.org/TR/REC-xml. Viewed Oct. 5, 2004.
2 Item 1.1.10. Ibid.
3 For further discussion of MeSH as concept-centered, see Nelson SJ et al., "Relationships in Medical Subject Headings" in: Bean CA; Green, R. Relationships in the organization of knowledge. New York: Kluwer Academic Publishers, 2001. See also Johnston WD et al., "Redefining a Thesaurus: Term-Centric No More." Poster presentation at: AMIA 1998 Annual Symp.; 1998 Nov 10; Orlando FL.
4 Dick K XML: A Manager's Guide. (Reading, Mass: Addison-Wesley, 1999) p. 29.
5 See Item 3.3.1 (Attribute Types) in the XML specification in note 1.