MEDLINE® Character Set Expansion
Since the inception of MEDLINE, NLM® has limited the characters used to those typed from a standard US keyboard and a small set of frequently used diacritics (see this character set at Limited MEDLINE®/PubMed® Character Set).
Starting in early September 2010, NLM will accept for newly created MEDLINE records any UTF-8 character in the Latin (Roman) and Greek scripts as well as mathematical and other symbols commonly found in biomedical literature. Other scripts such as Chinese, Japanese, or Korean are not supported (see MEDLINE®/PubMed® Character Set for the expanded character set).
The most notable difference is the addition of Greek characters to the database. Previously, NLM spelled out Greek letters, for example, replacing β (Unicode 03B2) with beta. PubMed users are now able to search for these characters either by copying and pasting the text from an online source or by spelling out the letter as they always have done. Both approaches retrieve the same set of citations.
NLM will continue to standardize some characters:
- All instances that represent a Double Quote will be translated to the straight double quote " (Unicode 0022).
- All instances that represent a Single Quote (this includes prime and apostrophe) will be translated to the straight single quote ' (Unicode 0027).
- Em Dash, En Dash, Hyphen, or Minus will be translated to the single dash - (Unicode 002D).
See Diacritics in PubMed® Displays and Searching for additional information.
Shore J. MEDLINE® Character Set Expansion. NLM Tech Bull. 2010 Jul-Aug;(375):e13.