NLM Home Page AUTOMATED ASSIGNMENT OF MEDICAL SUBJECT HEADINGS

Stuart J. Nelson, MD, Alan R. Aronson, PhD, Tamas E. Doszkocs, PhD, W. John Wilbur, MD, PhD,
Olivier Bodenreider, MD, PhD, H. Florence Chang, MS, James Mork, MS, and Alexa T. McCray, PhD
National Library of Medicine, Bethesda, MD
 
Introduction
As part of the National Library of Medicine's Indexing Initiative, we developed and compared automated methods of assigning Medical Subject Headings (MeSH) to citations on a test collection of 200 randomly selected MEDLINE citations published in 1997.

Methods
The following methods of finding and ranking suitable MeSH descriptors have been investigated using this test collection:
Parsers. Phrasex [1], a barrier word method [2], and the MIT parser were all used to extract noun phrases from the titles and English abstracts of the citations.
INQUERY Algorithm. This algorithm uses the INQUERY search engine [3] to match extracted noun phrases to MeSH descriptors. Co-occurring MeSH descriptors in the UMLS are used to suggest additional headings.
MetaMap. MetaMap [1] develops an ordered list of UMLS Metathesaurus concepts for each citation, based on the noun phrases extracted from that text. A ranked list of concepts is developed for each phrase.
Approximate Matching. This version of MetaMap uses less restrictive rules in mapping noun phrases to UMLS Metathesaurus concepts.
Trigram Algorithm. A phrase is broken into overlapping trigrams (three letters occurring in succession) for analysis. Candidate phrases are obtained from the title and abstract by examining all maximal contiguous sets of words that contain no punctuation or stop words (from a list of 310 common stop words). The trigrams are used to match phrases in the UMLS, with the maximal overlap of sets of trigrams resulting in the suggested UMLS concept.
Restrict to MeSH. Once a UMLS concept has been identified, the task becomes one of navigating within the UMLS to find the appropriate MeSH heading. This method was described previously. [4]
PubMed Related Citations Method. This method depends on the assumption that the semantic neighbors of a document are those documents in the database that are the most similar to it. [5] The similarity between documents is measured by the words they have in common, with some adjustment for document lengths. The test document is used as the basis for finding similar documents. MeSH descriptors assigned to similar documents are then assigned to the test document.
Clustering of Suggested Headings. After using one or more of the above methods, the suggested MeSH headings are clustered. Descriptors close together in the same MeSH trees are given additional weight, as are the descriptors known to co-occur with high frequency in MEDLINE. The suggested headings are then presented in rank order.
graphic of process
Testing
A web-based interface was developed to allow the testing of each of these paths either singly or in combination. Testers were also allowed to vary parameters affecting the weighting given to suggested MeSH headings. Parameters altering the combination of suggested headings from different paths (“clustering”) could also be adjusted. Testing was done interactively, with the opportunity to view results of any alteration, or could be done in batch mode. Testing in batch mode allowed calculation of average precision and recall over the test set of 200 citations using a single set of parameters.

Results and Conclusions
A formal trial of the methods has not yet been completed. However, several observations can be made. Use of different parsers to extract noun phrases from title and abstract did not appear to significantly alter the performance. In clustering and weighting the suggested headings, the most important aspect appeared to be the number of times a given descriptor was suggested. Second in importance was the semantic relationships between descriptors. The numerical value of the weighting factors had little effect. It appears that methods based on natural language processing and mapping to MeSH are complementary to the related articles method, and that any system should therefore use a combination of those methods.

References

1. Aronson AR, Rindflesch TC, Browne AC. Exploiting a large thesaurus for information retrieval. Proceedings of RIAO 94, 197-216, 1994.

2. Nelson SJ, Olson NE, Fuller LF, Tuttle MS, Cole WG, Sherertz DD. Identifying Concepts in Medical Knowledge. MEDINFO 95, 33-6, 1995.

3. Callan JP, Croft WB, Harding SM. The INQUERY Retrieval System. Proceedings of the 3rd International Conference on Database and Expert Systems Applications 78-83, 1992.

4. Bodenreider O, Nelson SJ, Hole WT, Chang HF. Beyond Synonymy: Exploiting the UMLS Semantics in Mapping Vocabularies. J Am Med Informatics Assoc (Symposium Suppl), 815-9, 1998.

5. Wilbur WJ, Coffee L. The effectiveness of document neighboring in search enhancement. Information Processing & Management, 30(2):253-266, 1994.
Last updated: 20 November 2001