The Year 2000 Solution for ELHILL and the MEDLARS Databases. Sep-Oct 1998. NLM Technical Bulletin

[Editor's Note: This article is a technical presentation of the implementation of Year 2000 compliancy for NLM's ELHILL databases. Please see the Year-End Processing article in the September-October 1997 NLM Technical Bulletin for search hints.]

In the Spring of 1997, the NLM Information Retrieval System (IRS) ELHILL was providing access to approximately 35 databases of 20 million citations and about 40 gigabytes of disk storage. These data came from a variety of sources, both internal and external to the NLM, and were processed through standard MEDLARS programs and individualized conversion programs.

The Office of Computer and Communications Systems (OCCS) was tasked to make all the computer systems, both hardware and software, Year 2000 compliant, as mandated by law. With the current retrieval system expected to be replaced in another 1 1/2 to 2 years, it became necessary to find a solution that would not take that much time to implement.

There are basically four types of date fields in MEDLARS, as follows:

a): two digit fields consisting of just the last two characters of the year, e.g., Year (YY) --- '98'
b): four digit fields consisting of the last two characters of the year and the relative month, e.g., Entry Month (YYMM) --- '9805' --- May 1998
c): six digit fields consisting of the last two characters of the year, the relative month, and the day, e.g., Date of Entry (YYMMDD) --- '980529' --- May 29, 1998
d): four digit (or more) fields beginning with the full representation of the year, including the Century (CC), e.g., Date of Publication (CCYY......) --- '1998', '1998 May 29,' or, 1998 Spring', etc.

Only the first three needed to be adjusted for both retrieval and display to the user; the fourth was already Year-2000 compliant. One additional factor had to be addressed: Ranging. ELHILL allows ranging in the form of 'less than x', 'greater than y', and 'from x to y', where 'x' and 'y' represent whole numbers. The overwhelming majority of ranging in the ELHILL IRS is on dates of the forms (a), (b), and (c), as shown above. Clearly, a ranging operation using 'from 99 to 01' would be illegal as the upper bound is less than the lower bound. Therefore, in addition to direct searching and display, numeric ranging would have to be addressed.

The main aim of the solution was to avoid changing the data in the citation, but give the user the appearance of having changed the data. Since almost all of the data in the MEDLARS databases was published starting in the 1960's, MEDLARS therefore offers a special case which might not be available to other systems. With the exception of a special presentation rule for display (printing) in the ELHILL IRS, all the necessary changes could be made in the File Generation and Maintenance (FGM) job stream(s) which build and maintain the databases.

The FGM Subsystem of MEDLARS is composed of a series of programs, sorts, and merges, which process new and maintained citations and:

a): build and maintain the citation itself, creating two sequential files of intermediate index points and ranging points,
b): add enrichment data such as Medical Subject Heading (Trees), Pre-Explosions, etc., to the intermediate indexing points,
c): merge the ranging values and (now-enriched) index points with the existing indexes and ranging file(s) to complete the building process.

The solution consists of a single sort and a new program to run between steps (a) and (b) immediately above, as follows:

a): the sort, driven by specification, would create a duplicate file of index points for those fields to be enriched with the Century (CC) representation. Note that the original fields without the Century representation are untouched, still allowing the user to search without the Century representation. As the index points, whether numeric or not, are represented by characters, not binary or decimal, the first character is examined and if found to be a '0' has a '20' appended before the original data; if not a '0', a '19' is appended. This change, good until 2009 (well after the current system is to be replaced), allows the user to search '1998' as both '98' and '1998', and allows '01' to be searched as both '01' and '2001'. These updated index points are merged back with the intermediate index points and no other changes are necessary.
b): the same program adjusts the ranging values, again driven by specification. Ranging presented a more difficult problem as the numeric data are represented as four-character binary fields and not characters. It was therefore necessary to know whether the intermediate ranging data to be adjusted originated from a two-, four-, or six-character field. This was supplied in the specifications. The binary data in the field to be adjusted was compared to 10**(n-1) where 'n' was the number of digits in the original field (2, 4, or 6). If the value was less than 10**(n-1), then 20*10**n was added to the field; if greater or equal, than 19*10**n was added, e.g., let us assume we are adjusting a two-character Year field containing the binary representation of '98'. '98' is greater than 10**(2-1) or 10, and 19*10**2 (1900) is added to '98', making it '1998'. If the value were '01', it would be found to be less than 10**(2-1) and 20*10**2 (2000) would be added to '01', making it '2001'. The same holds true for fields originally composed of 4-, or 6-digit fields. These now updated ranging points replace the original file or ranging points and no other changes are necessary.
c): a print rule was written for ELHILL (there are over 30 of these presentation rules) which adjusts the field using the same logic as described in (a) above; a '19' or a '20' is appended at the beginning of specified fields for presentation depending on the first character.

The requisite programming was written and tested in the Spring of 1997. As the NLM replaces just about all of its databases starting in the Summer in a process known as Year-End Processing (YEP), it was decided to implement the change during the rebuilding process. All updated and now Year 2000 compliant databases were replaced in mid-December 1997, along with the IRS presentation changes, and the system has been running successfully since that time without error.

As the data distributed to our tape recipients were unchanged, the above described algorithm was made available to them in the Fall of 1997 before the data were distributed in late December.

--prepared by David Kenton, Database Administrator

Office of Computer and Communications Systems

Table of contents

Home

Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health, Department of Health & Human Services
Copyright, Privacy, Accessibility, Freedom of Information Act (FOIA)
HHS Vulnerability Disclosure
Last updated: 13 February 2004