Character Set

U. S. National Library of Medicine
NLMCatalogRecord data in XML Format

About the Character Set used in NLMCatalogRecord:

XML-formatted NLM catalog records issued in the NLMCatalogRecordSet chiefly contain standard Latin characters but may also contain spacing or non-spacing diacritical marks, subscripts, superscripts and other special characters defined for use in MARC 21 records. These may occur in any non-numeric field where called for to accurately record the data. However, the Chinese, Japanese and Korean characters entered using the East Asian Character Code (EACC) in the MARC-8 environment are not currently included in NLMCatalogRecordSet records.

The XML file uses UTF-8 encoding (from ISO/IEC 10626 and Unicode Standard -- see Unicode for more information on unicode and UTF-8 encoding). The UTF-8 encoded data is in unicode Normalized Form C (see Unicode Technical Report #15), which uses unicode composite characters. This approach is consistent with the direction of the World Wide Web Consortium as described in Character Model for the World Wide Web.

Normalized Form C was adapted for NLMCatalogRecordSet in order to conform with NLM XML records distributed from MEDLINE. Form C differs from the "decomposed" Form D, which is currently defined for expression of MARC 21-formatted records in UTF-/Unicode. In Form D, the diacritic is encoded AFTER the letter it modifies; for more information see: //www.loc.gov/marc/specifications/speccharintro.html

Because of the large number of characters which could conceivably occur, NLM will not attempt to provide a complete list of characters possible in NLMCatalogRecordSet.

Last Reviewed: May 11, 2020