Specs for NLM Digital Repository Objects
Two-Dimensional Materials (Texts, Manuscripts, Graphics)
NLM currently digitizes printed monographs in-house using two Book2net Cobra V-Scan image capture systems (image area approx. 18" X 25" per page) and a Zeuteschel 14000-A large format scanner (image area approx. 24" X 38"). The Zeutschel scanner is used primarily for fold-outs and large flat paper objects. Initial monograph digitization was done in-house on a Kirtas KABIS III scanner. Some additional content was scanned offsite by vendors. Digitized texts contain the following components:
Monographs and Serials
Per Book
- OCR - composite text file
- Full color PDF
- Descriptive metadata files (see Descriptive Metadata below)
- METS - an xml document produced by the scanner’s software which encodes page image sequencing as well as technical details of the scanning operation
- Preview image in JPG format
Per Page
- Currently, master images are in uncompressed TIFF format, 400 DPI, 24bit color, with some key technical metadata embedded. Images captured with Kabis III scanner have JPEG masters.
- Access derivative image in JPEG2000 format
- OCR - page text file
- ALTO XML file of text with structural markup
- MIX - an XML schema for encoding the structure of digital still images, this file is produced per page by NLM’s book scanner and embedded in the book-level METS file
- Thumbnail in JPG format
Other Formats
Digitized manuscript materials, graphical prints of ink on paper, photographic prints, maps, and photographic films are produced typically with TIFF masters and JPEG access derivatives. Scanning DPI standards are generally higher, up to 600 DPI. Most of these materials do not generate text files.
Moving Images (Films and Videotapes)
Film and video materials are digitized to MPEG2 from BetacamSP or DVD copies. The BetacamSP preservation copies are produced by offsite vendors.
- MPEG2 digital master is a full resolution, 640x480 NTSC video, with audio as in the original
- Access derivative are created in the MP4 format with h.264 compression and AAC audio, in two different compression and resolution pairings, optimized for particular display devices
- Descriptive metadata files (see Descriptive Metadata below)
- Transcript (text file)
- Time-coded captions in DFXP and SRT formats
- Preview image in JPG format
Audio
- Master recording digitized to .wav format
- Access derivatives are created in .mp3 format
- Transcript
Descriptive Metadata
- Typically each resource in Digital Collections has a corresponding MARC bibliographic record in NLM's LocatorPlus Catalog.
- For each digitized resource, three descriptive metadata files are generated:
- MARCXML - the base metadata used to generate other metadata in Digital Collections; supplied to Internet Archive for NLM digital resources also made available on that site
- OAI compliant Dublin Core - for public consumption
- DMDINDEX - an internal custom scheme that drives indexing and UI display
Last Reviewed: June 2, 2021