 |
AUTOMATED ASSIGNMENT OF MEDICAL SUBJECT
HEADINGS
Stuart J. Nelson, MD, Alan R. Aronson, PhD,
Tamas E. Doszkocs, PhD, W. John Wilbur, MD, PhD,
Olivier Bodenreider, MD,
PhD, H. Florence Chang, MS, James Mork, MS, and Alexa T. McCray, PhD
National Library of Medicine, Bethesda, MD |
|
|
Introduction
As part of the National Library of Medicine's
Indexing Initiative, we developed and compared automated methods of assigning
Medical Subject Headings (MeSH) to citations on a test collection of 200
randomly selected MEDLINE citations published in 1997.
Methods
The following methods of finding and ranking
suitable MeSH descriptors have been investigated using this test
collection: Parsers. Phrasex [1], a barrier word
method [2], and the MIT parser were all used to extract noun phrases
from the titles and English abstracts of the citations. INQUERY Algorithm.
This algorithm uses the INQUERY search engine [3] to match extracted
noun phrases to MeSH descriptors. Co-occurring MeSH descriptors in the UMLS are
used to suggest additional headings. MetaMap. MetaMap [1]
develops an ordered list of UMLS Metathesaurus concepts for each citation,
based on the noun phrases extracted from that text. A ranked list of concepts
is developed for each phrase. Approximate Matching. This version of
MetaMap uses less restrictive rules in mapping noun phrases to UMLS
Metathesaurus concepts. Trigram Algorithm. A phrase is broken into
overlapping trigrams (three letters occurring in succession) for analysis.
Candidate phrases are obtained from the title and abstract by examining all
maximal contiguous sets of words that contain no punctuation or stop words
(from a list of 310 common stop words). The trigrams are used to match phrases
in the UMLS, with the maximal overlap of sets of trigrams resulting in the
suggested UMLS concept. Restrict to MeSH. Once a UMLS concept has
been identified, the task becomes one of navigating within the UMLS to find the
appropriate MeSH heading. This method was described
previously. [4] PubMed Related Citations Method. This method
depends on the assumption that the semantic neighbors of a document are those
documents in the database that are the most similar to it. [5] The
similarity between documents is measured by the words they have in common, with
some adjustment for document lengths. The test document is used as the basis
for finding similar documents. MeSH descriptors assigned to similar documents
are then assigned to the test document. Clustering of Suggested
Headings. After using one or more of the above methods, the suggested MeSH
headings are clustered. Descriptors close together in the same MeSH trees are
given additional weight, as are the descriptors known to co-occur with high
frequency in MEDLINE. The suggested headings are then presented in rank
order. |
 |
Testing A web-based interface was developed to allow the testing
of each of these paths either singly or in combination. Testers were also
allowed to vary parameters affecting the weighting given to suggested MeSH
headings. Parameters altering the combination of suggested headings from
different paths (clustering) could also be adjusted. Testing was
done interactively, with the opportunity to view results of any alteration, or
could be done in batch mode. Testing in batch mode allowed calculation of
average precision and recall over the test set of 200 citations using a single
set of parameters.
Results and
Conclusions A formal trial of the
methods has not yet been completed. However, several observations can be made.
Use of different parsers to extract noun phrases from title and abstract did
not appear to significantly alter the performance. In clustering and weighting
the suggested headings, the most important aspect appeared to be the number of
times a given descriptor was suggested. Second in importance was the semantic
relationships between descriptors. The numerical value of the weighting factors
had little effect. It appears that methods based on natural language processing
and mapping to MeSH are complementary to the related articles method, and that
any system should therefore use a combination of those methods.
References 1. Aronson AR,
Rindflesch TC, Browne AC. Exploiting a large thesaurus for information
retrieval. Proceedings of RIAO 94, 197-216, 1994.
2. Nelson SJ,
Olson NE, Fuller LF, Tuttle MS, Cole WG, Sherertz DD. Identifying Concepts in
Medical Knowledge. MEDINFO 95, 33-6, 1995.
3. Callan JP,
Croft WB, Harding SM. The INQUERY Retrieval System. Proceedings of the 3rd
International Conference on Database and Expert Systems Applications 78-83,
1992.
4. Bodenreider O, Nelson SJ, Hole WT, Chang HF. Beyond
Synonymy: Exploiting the UMLS Semantics in Mapping Vocabularies. J Am Med
Informatics Assoc (Symposium Suppl), 815-9, 1998.
5. Wilbur WJ,
Coffee L. The effectiveness of document neighboring in search enhancement.
Information Processing & Management, 30(2):253-266,
1994. |