Diacritics in PubMed® Displays and Searching
A diacritic is a mark that modifies a letter and indicates a different phonetic value or pronunciation from the unmarked letter, such as the acute accent over the letter e, é, in the French language. The National Library of Medicine® (NLM®) has always used a certain set of diacritical marks in its journal citation data (see http://www.nlm.nih.gov/databases/dtd/medline_characters.html) and displayed them in print publications such as Index Medicus. Note that the list of marks is limited and that NLM does not use them in combination with capital letters (with the exception of the Swedish capital letter O, Ø, and the Polish capital letter L, Ł). NLM converted to using Unicode (UTF-8) encoding for our character set when we transitioned off our mainframe computer to relational database technology around the year 2000; previously we had used an EBCDIC (Extended Binary Coded Decimal Interchange Code) character set.
With the debut of PubMed on the World Wide Web, NLM continued to use diacritics but did not display them as a default setting because of potentially confusing users over how to search when those characters did not appear on most keyboards in use in the United States. Now the growth of the Web and growth in international PubMed use along with widespread availability of UTF-8 character set printing capabilities has led NLM to display diacritics in PubMed.
Since late April, when we changed to the new Entrez System (see NCBI to Introduce Changes to the Entrez System — Beta Version Available for Preview. NLM Tech Bull. 2007 Mar-Apr;(355):e7), diacritical marks have been displayed in author names and affiliation (first author's address) on the AbstractPlus, Abstract, and Citation displays (see Figure 1).
Figure 1: Diacritical marks in the Author and author Affiliation fields in the PubMed AbstractPlus Display .
Today, diacritics were added to the Summary display (see Figure 2) for new citations and next week should be displayed in all citations for which diacritics are available. The XML display option has always shown the diacritical marks. The MEDLINE display will not show diacritics, as this has historically been a straight ASCII (American Standard Code for Information Interchange) presentation of only 128 characters.
PubMed pages generate the default character setting of Unicode (UTF-8) for optimal viewing of diacritical marks.
In general, most diacritical marks appear in author names and affiliation and Transliterated/Vernacular Title fields with some marks occurring in the Article Title, Abstract, Personal Name as Subject or Full Journal Title fields. (Note: The Full Journal Title field may contain characters not in the MEDLINE Character set because this element is derived from Voyager, the NLM Integrated Library System, which has a larger character set.)
Please note that diacritic marks that did not successfully convert to Unicode display as an inverted question mark. As time and resources permit, these will be corrected.
All PubMed searching for terms containing diacritical marks ignores those marks, even if users enter them in a search query box (by cutting and pasting or by direct entry). Therefore, searches that include diacritics will retrieve results for terms that include the diacritic as well as terms that do not. If you search with plain letters, your retrieval will include results for terms with the diacritic as well as those without. In other words, search results are "diacritics-neutral" (see Figure 3).
Searching uses the plain letter equivalent whether the query is user-entered, or system generated such as the author name search links that are launched by clicking on an author name from most displays (see Figure 4).
Knecht LWS, Canese K. Diacritics in PubMed® Displays and Searching. NLM Tech Bull. 2007 Nov-Dec; (359):e4.