NLM® Uses XML for MEDLINE® Data
When the National Library of Medicine endeavored to modernize its computer systems in the late 1990s/early 2000s, the NLM chose XML (eXtensible Markup Language) as the new tagged format for disseminating its MEDLINE bibliographic citation data. A DTD (Document Type Definition) defines the structure of this XML. This is the only distribution format for MEDLINE data created beginning with the 2001 production year. The suite of DTDs used for MEDLINE/PubMed data are available at http://www.nlm.nih.gov/databases/dtd/.
This decision strengthens NLM's commitment to distribute its journal citation data in a format that is widely described and, therefore, familiar to many in the information industry, especially in the Internet Web environment. XML lends itself to human readability as well as to easy machine manipulation. In its transition to XML, NLM took the opportunity to examine the MEDLINE unit record and make organizational data changes such as moving information that enhances the article title (e.g., errata and retraction information) into separate elements and providing for new elements such as a corporate author name. Changes have been made to the DTDs as needed, usually on an annual basis.
Choosing XML as the data format was a natural extension of NLM's use of XML to receive bibliographic data electronically from publishers. XML also supports UNICODE, a universal character set that permits NLM to continue to use selected diacritical marks, an important consideration for the worldwide nature of MEDLINE data. NLM is using the UTF-8 encoding form of UNICODE.
In addition to the availability of MEDLINE in XML, NLM also produces an XML version of the following:
1. its controlled vocabulary Medical Subject Headings® (MeSH®), a companion information resource for MEDLINE,
2. its monographic and serial data which is also available in MARC21 (USMARC) and
3. its toxicological and environmental health data.
Links to those DTDs are also available at http://www.nlm.nih.gov/databases/dtd/.
Use of XML at NLM also is part of other products and services such as in the behind-the-scenes programming for the NLM Gateway, a search tool that upon one request from a user, queries several back-end retrieval systems operational at NLM. The current MEDLINE Data Creation and Maintenance System (DCMS) is another example.
The legacy MEDLINE database was ported from a proprietary database on an IBM mainframe to a commercially available relational database product on UNIX. The application to create and maintain MEDLINE citations is a web-based application that uses dynamic generation of html from the database. NLM uses XML files with a DTD for input/output files to/from the MEDLINE database. NLM's current approach to XML generation from the database uses a C program written by NLM that generates DTD-compliant XML. Input to the database also uses a program written in C and makes use of an XML parser for C to parse the input XML file.