NLM is Moving its Sequence Read Archive (SRA) to the CloudApril 10, 2019
The National Library of Medicine (NLM) has begun moving biomedical sequence data from the SRA onto Google Cloud and Amazon Web Services cloud platforms.
SRA is the world’s largest publicly available repository of high-throughput sequencing data. The database supports discoveries in genomic variation, gene expression, functional genomics, bioinformatics methods development, detection of novel genes, and microbial species.
This migration to the cloud is part of NIH’s Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability (STRIDES) initiative.
“The STRIDES initiative demonstrates what a 21st century library can be—making data and information important to biomedical science findable, accessible, interoperable and reusable,” said NLM Director Patricia Flatley Brennan, R.N., Ph.D.
Currently, the SRA database contains more than 26 petabytes (or 26 quadrillion bytes) of sequence data and is growing at an exponential rate. SRA includes data that is open to the public as well as some types of human data that require approval to access. Due to SRA’s extraordinarily large size and the nature of the data, movement to the cloud will take place in stages. Publicly available data will be migrated by the end July 2019, followed by authorized-access data. These data sets will be in a standardized compressed format. Versions of the data in the original format will then migrate, with a targeted completion date of 2021. During the migration, SRA will continue to operate as usual.
Currently, scientists who want to compare sequences typically download a subset and perform their computations locally.
“Moving SRA to the cloud means that for the first time, scientists will be able to explore all of the sequence data there, providing a tremendous opportunity for research and discovery,” said Jim Ostell, Ph.D., director of the National Center for Biotechnology Information at NLM. “Scientists will no longer be limited by the size of their local storage or their computing capacity, and they will be able to take advantage of tools that others develop for use on the cloud.”
The National Library of Medicine (NLM) is a leader in research in biomedical informatics and data science and the world’s largest biomedical library. NLM conducts and supports research in methods for recording, storing, retrieving, preserving, and communicating health information. NLM creates resources and tools that are used billions of times each year by millions of people to access and analyze molecular biology, biotechnology, toxicology, environmental health, and health services information. Additional information is available at https://www.nlm.nih.gov.
Last Reviewed: April 10, 2019