Skip Navigation Bar

NLM’s Digitization Specifications


Two-Dimensional Materials (Texts, Manuscripts, Graphics) 

The majority of monographs were scanned in-house using a Kirtas KABIS III scanner.  Additional content was scanned offsite by vendors. Digitized texts contain the following components:

Monographs and Serials

Per Book

  • OCR - composite text file
  • Full color PDF
  • Descriptive metadata files (see Descriptive Metadata below)
  • METS - an xml document produced by the scanner’s software which encodes page image sequencing as well as technical details of the scanning operation
  • Preview image in JPG format

Per Page

  • Master image in either JPEG or TIFF format, 400 DPI, 24bit color, with some key technical metadata embedded
  • Access derivative image in JPEG2000 format
  • OCR - page text file
  • ALTO XML file of text with structural markup
  • MIX - an XML schema for encoding the structure of digital still images, this file is produced per page by NLM’s book scanner and embedded in the book-level METS file
  • Thumbnail in JPG format
Other Formats

Digitized manuscript materials, graphical prints of ink on paper, photographic prints, and photographic films are produced typically with TIFF masters and JPEG access derivatives.  Scanning DPI standards are generally higher, up to 600 DPI. Most of these materials do not generate text files.

Moving Images (Films and Videotapes)

Film and video materials are digitized to MPEG2 from BetacamSP or DVD copies.  The BetacamSP preservation copies are produced by offsite vendors.

  • MPEG2 digital master is a full resolution, 640x480 NTSC video, with audio as in the original
  • Access derivatives are created in-house using popular formats:
    • .mp4 with h.264 compression and AAC audio, in three different compression and resolution pairings, optimized for particular display devices
    • .mov in high quality, full resolution
    • .wmv in high quality, full resolution
  • Transcript (text file)
  • Time-coded captions in DFXP, QuickTime and MAGpie formats


  • Master recording digitized to .wav format
  • Access derivatives are created in .mp3 format
  • Transcript

Descriptive Metadata

  • Typically each resource in Digital Collections has a corresponding bibliographic record in NLM’s Integrated Library System (ILS).
  • Some manuscript collections have item-level metadata as well, usually in Dublin Core.
  • From each bibliographic record (in MARCXML format) , a Dublin Core record and a custom metadata record “DMDINDEX” are derived via XSLT.
  • Both the Dublin Core and DMDINDEX records are used for displaying and indexing purposes in Digital Collections.