History of Medicine
About IndexCat™ Project
The digitization of the Index-Catalogue of the Library of the Surgeon-General’s Office (Index-Catalogue) was a collaborative project conceived by the American Association for the History of Medicine (AAHM) through its ad hoc Committee on Electronic Media (COEM). The Wellcome Trust and the Burroughs Wellcome Fund provided early support for a pilot project. The National Library of Medicine (NLM) funded the digitization phase and hosts IndexCat™ – the digitized name – at its Web site on the Internet: http://indexcat.nlm.nih.gov.
Need for Digitization
The COEM identified the need for digitization of the Index-Catalogue citing it as “an important research tool … heavily used by a broad range of scholars in biomedicine, general sciences and the humanities.” Manual searching of its sixty-one (61) printed volumes is a tedious task that often results in missed relevant citations. A citation listed under a subject heading in one series may have its author and title references in another series. The printed Index-Catalogue lists authors of journal articles only under subject headings; these names are not part of the primary dictionary arrangement and not directly accessible.
The Index-Catalogue is also out-of-print. Most printed sets are available in North America and Western Europe. Printed access in other geographic areas is severely limited. Where printed sets exist, they are deteriorating because of use and age.
Digitization of the Index-Catalogue makes it a free, world wide resource, accessible through the Internet.
In 1996, a sub-committee of COEM submitted proposals for funds to support prototype projects. In March 1997, The Wellcome Trust provided $100,000 to test digitization feasibility and prototype approaches. The Burroughs Wellcome Fund gave $75,000 to support the overall project.
After competitive bid and review, the COEM funded two proposed prototypes; one from AND-USA, Inc.(AND Inc.) and the other from ATLIS. The Committee received comments about each prototype from a broad range of users. In addition to AAHM members and research colleagues, NLM staff in the Bibliographic Services Division and the History of Medicine Division tested and commented.
On July 22, 1998, the COEM sub-committee officially presented both prototypes to the National Library of Medicine. The COEM asked the NLM to join the project and assist in funding the digitization phase. After further informal discussions, the NLM agreed to join the digitization effort.
In September, 1999, the National Library of Medicine awarded a contract for digitization of the complete printed Index-Catalogue comprised of five (5) series in a total of sixty-one (61) volumes. The estimated total number of bibliographic citations was over three (3) million.
The NLM contract, awarded through Drexel University, authorized a two-year digitization period.. The principal investigator was Russell Maulitz (M.D., Ph.D.) of Drexel University and chair of the COEM. The sub-contractor was AND, Inc. through its Baltimore office which is now closed. Lillian R. Kozuma from the History of Medicine Division served as the NLM project officer.
Digitization of the Index-Catalogue did not utilize optical scanning techniques (OCR). The process combined traditional triple manual keying with computer validation routines for content and SGML (Standard Generalized Markup Language) formatting. (NLM XML (Extensible Markup Language) format standards were still in flux at that time).
The prototype projects rejected OCR as a viable technique. Index-Catalogue incorporates multiple font types mixing italic, bold, regular, and CAPS for all type sizes. At least a dozen different type fonts exist. In order to decrease original printing costs, fonts applied are small. The typeface for 2.5 million journal article references is the smallest size, less than 7 points. These citations print in a block with no line separations between them. In addition, Index-Catalogue includes citations in more than a dozen languages with diacritical marks. Greek, Russian, and Hebrew scripts are also included. These factors, combined with a requirement for 99% accuracy in character translation, eliminated application of OCR.
Manual keying converted more than 3.7 million citations, headings, and references during the two-year digitization period. The average record has 100 characters resulting in over 3.7 billion characters keyed.
AND Inc., the project’s sub-contractor, managed the actual keying process. They were responsible for adhering to and applying the quality control factors that maintained contract standards. Their chief engineer, Eric Grivel, developed initial data structures and definitions, and continuously refined validation routines for quality control during the digitization process.
A joint Quality Assurance committee reviewed digitized output. The committee membership included the Principal Investigator as Chair, the NLM project officer, and members representing AAHM and Drexel University. AND Inc. staff attended on an ex-officio basis. The Monthly committee meetings assessed and discussed the previous month’s delivery. The first review revealed significant errors and resulted in rejection of the first batch. The sub-contractor initiated additional training, instruction, and re-developed computer routines to improve quality. The contract quality control standard required accuracy for 99 out of 100 keyed characters.
The committee selected volumes for review, discussed results of detailed reviews for these volumes based upon a 1% sample of each volume. A random number generator on the Web selected the volume page numbers for the review cycle. Reviewers looked at full pages rather than scattered individual citations in order to review keying, sequencing, and coding of Index-Catalogue’s dictionary arrangement. If an entire page had format problems, reviewers examined surrounding pages. This method detected generic as well as individual problems.
The committee determined if batch quality was acceptable for payment and discussed error corrections through programming changes. Computer routines corrected known errors both for the existing database and for newly keyed citations. These enhancements expanded the usefulness of review information since it was possible to correct large groups of citations by computer.
There was language assignment. Since language is not explicitly stated in the printed catalog, it was assigned through a computer comparison for place of publication with a location list with language. This routine wasn’t accurate because of same or similar place names in several countries with different languages. The quality review committee, therefore, recommended delay of language searching until substantial corrections are possible. NLM is trying to determine a computer procedure for corrections since individual citation review is costly.
NLM received a final digitized copy of Index-Catalogue in December, 2001 in SGML format and converted it into XML.
In late December 2001, the NLM purchased the ENCompass software as the platform for delivering IndexCat™ to the Internet. ENCompass is a product of Endeavor Information Systems Inc. (now known as Ex Libris Group).
In May 2002, Endeavor’s technical consultants met with NLM staff to determine the Document Type Definition (DTD) structure for loading the XML data into ENCompass. We are using the Encoded Archival Description (EAD) metadata format.
Conversion for each Index-Catalogue series was a separate load using Endeavor’s metaloader via Endeavor designed scripts. IndexCat™ data is very complex and script analysis, modifications, testing, and data reviews took two years before public release. Series 2-4 became available on the Web on May 1, 2004 at the AAHM meeting in Madison, Wisconsin. Series 5 was added on May 14, 2004 and Series 1 on June 18, 2004.