00:02
[Peter Cooper] Okay, so I'm going to go ahead and turn this
00:04
over to Eric and he's going to take you through datasets and online tools. Take it away, Eric.
00:09
[Eric Cox] Thank you,
00:10
Peter. Thanks, everybody, for coming today. Welcome to the webinar on using
00:16
datasets commandline tools to download and use genome metadata.
00:20
So here is an outline of the webinar today. First I'm going to introduce the datasets project
00:27
and our command-line tools . Then I'm going to go over two examples of using our command-line tools.
00:35
So first I'm going to show you one way to get high quality butterfly genomes,
00:41
and then I'm going to show you how to find SARS-CoV-2 lambda variant genomes.
00:47
So the butterfly genome example will hopefully give you some general ideas
00:51
about how you can get genome metadata and the sequence data for really any organism,
00:57
while the SARS-CoV-2 example is a little more specifically focused on SARS-CoV-2 data.
01:06
So datasets is a relatively new effort at NCBI
01:10
that's focused on making sequence data easy to find and download.
01:15
We have web pages including our home page on here,
01:20
command-line tools, that I'm going to talk about today, and documented public rest API.
01:27
A big emphasis of the project is providing easier access to genome data,
01:32
including genome sequence, genome annotation, and transcript and protein sequences.
01:41
So today's presentation is just focused on accessing genome data using our command-line
01:47
tools. So as I said, we are looking at two examples of using our command-line tools . In
01:54
both examples we'll use the datasets tool. This is for downloading biological sequence data,
02:01
and in the second example we'll also talk about dataformat, which is used to generate
02:06
human readable metadata tables. We don't have time today to go over installing the tools, but you can
02:15
download them from the datasets website, and we have recently made them available through conda.
02:25
We are also going to use two other third party tools, mostly jq and we'll also touch a little bit
02:33
on samtools. jq is a great tool for working with JSON formatted data.
02:41
The datasets command-line tool delivers metadata in two flavors of JSON, JSON and JSON lines.
02:49
So just a quick note about versions. Today we're talking about version 12 of the datasets
02:56
command-line tools. These are expected to remain stable for the foreseeable future,
03:03
and these are based on version 1 of our API. And at the moment we are starting development on API
03:12
version 2, which will power future releases of the command-line tools,
03:17
and we are planning a lot of new features and some breaking changes. This is a great
03:22
time to give us feedback on where we'll be going next as we start planning version 2 of the API.
03:31
So for the first example, we are going to download metadata describing butterfly genomes,
03:38
then use information in this metadata to help us identify and download a subset of high quality
03:44
genomes. We are going to use a built-in filter to get butterfly genomes at the chromosome
03:51
assembly level . Then we're going to further restrict this set of chromosome level butterfly
03:57
genome assemblies to those above a certain contig N50 value. Along the way we'll show
04:05
you how to get genome metadata, how to convert genome metadata into human readable formats,
04:11
how you can filter the metadata, and finally, how to download a set of genome assemblies.
04:16
We are going to start by downloading genome metadata for all butterflies,
04:20
so we are going to go ahead and switch to the terminal window now.
04:26
Okay. So one way to get genome metadata is by using the datasets summary command.
04:34
So I'm going to go through each part of the command and describe what it means. So datasets
04:40
summary means that we are going to get metadata, genome specifies the type of metadata, and taxon
04:49
tells you what kind of a query you are going to use to get this genome metadata. So this can be a
04:55
taxonomic name or NCBI taxonomic I.D. In this case we are just going to use the name butterflies.
05:04
The assembly source parameter allows you to specify whether you want data for either GenBank
05:10
or RefSeq assemblies. Today we're just going to look at GenBank assemblies, keep things simple.
05:17
And we are going to run this command and save the output to a file, butterfly_summary.json.
05:25
So let's start with the simple question about this metadata. How many genomes are there?
05:34
So here's one way to get the genome count. We are going to use jq to pretty print the data to make
05:42
it easier to read and then we are going to pipe this to tail to show the last 10 lines of this
05:49
pretty printed data. Because the genome count is always shown at the end of the genome metadata
05:56
that you get using a summary command, this command will let you see the number of butterfly genomes
06:01
described by the metadata. So you can see here it just shows total count, 562, so that means there's
06:09
562 genomes. Another way to get the genome count is to specify the name of this total count field,
06:16
and using the power of jq, just pull the value out directly.
06:22
So here we just run jq and specify the field name, total count, and the file name. And you'll just
06:30
get that genome count back directly. So since we are looking for high quality butterfly genomes,
06:38
we can use the assembly level filter to only get chromosome level butterfly genome assemblies.
06:47
So we are going to run this summary command again, but this time we'll use two filters.
06:53
One filter to get genbank assemblies and one filter to get chromosome level assemblies.
07:03
Okay. So we've saved that metadata to a file, butterfly chrome level.json.
07:09
Now we are going to use jq, specifying the field name and total count to see how many
07:17
genomes -- how many butterfly genomes are at the chromosome assembly level?
07:23
So you can see that number is a lot smaller than what we had before. So we've reduced the set of
07:28
genomes from almost 600 genomes to less than 50. So it's great when you need to filter by
07:35
some attributes where we already have a built-in filter, like assembly level or assembly source.
07:41
You also have filters to only get annotated genomes or get genomes released before or
07:46
after a certain date. But sometimes you may need to filter by other attributes of a genome record
07:52
where we don't have built in filters available. So I'll show you how to do that using jq.
07:59
People often use contig N50 value as a measure of genome quality,
08:05
so to further narrow down our set of butterfly genomes, we are going to
08:09
take the list of chromosome level butterfly genome assemblies we already have, and only
08:15
keep the genomes where the contig N50 value is over 1.5 million.
08:21
So I'm going to show you how to make a simple table of assembly accessions
08:26
and contig N50 values using jq.
08:34
So basically what this command is saying is take this metadata describing the chromosome level
08:41
butterfly assemblies, send it to jq, and then for each assembly, give me back the assembly accession
08:53
and the contig N50 value. And then generate a tab delimited table in CSV format and then
09:04
I'm going to pipe that to head to just show the first five rows of that table.
09:10
So now you have this two column table. The first column has
09:14
assembly accessions, the second column shows you contig N50 values.
09:19
So next we're going to get the genome accessions for those
09:22
genomes with a contig N50 value over 1.5 million.
09:31
so again, we are sending the metadata for all chromosome level butterfly
09:36
genome assemblies to jq . Then this jq command for all assemblies in the set,
09:43
just give me back the assemblies with a contig N50 value
09:48
over 1.5 million, and then for each of those assemblies, just have me the assembly accession.
09:58
So this is a whole list of butterfly genome assemblies that are both
10:04
chromosome level assemblies with contig N50 values over 1.5 million.
10:11
And in case you're curious where these genomes came from, these are all from the Darwin tree
10:15
of life project, which is a project to try and sequence 24,000 eukaryotes from
10:20
Britain and Ireland. Next we're going to save this list to a file.
10:28
So I'm just going to run that same command again and save it to a file called accession list.
10:36
Now that we have this list of assembly accessions,
10:39
we can download the genome sequences for these genomes. So this takes a few minutes,
10:44
so I've already downloaded it in advance, but I'm going to show you the command that I used.
10:52
So this is what the download command looks like. I'm going to go through each part of the command,
10:58
explain what it means. This says download genome data given one or more accessions.
11:06
datasets download genome accession
11:06
In this case we are using a text file with a list of accessions we generated
11:11
previously to specify which genomes we want. So then why do you need all these exclude flags?
11:20
So datasets delivers our sequence and annotation data in a zip archive that we call a data package.
11:27
By default, these packages will include a bunch of different types of sequence data,
11:32
annotation data, and metadata.
11:35
So for example, for an annotated eukaryote genome,
11:39
we would include the genome sequence, transcript sequences,
11:44
annotation in GFF three format and metadata in one or more files we call data reports. So if
11:52
you only want to get genome sequence, you can add various exclude flags to, for example, exclude the
12:00
protein sequence, exclude the annotation file, and exclude the RNA or transcript sequences.
12:08
So let's take a quick look at what's in this downloaded package of genome sequence.
12:14
I'm going to use the tree command to show you
12:18
the directory structure and some of the files that are included in the download package.
12:23
So as I said, I've already downloaded this zip archive and extracted it on my computer.
12:31
We are going to run tree on this directory that contains the contents of that zip archive and then
12:38
pipe that to less so we can scroll through that output.
12:44
So you'll notice that data for each genome assembly is in a directory that's named with
12:51
that assembly accession. So in this case we asked for just genome sequence, so you'll see
12:58
fasta files in each of these directories representing each assembled chromosome,
13:04
sequences from unplaced and unlocalized sequences are in separate files.
13:10
And in each directory representing a genome assembly, there's a sequence
13:14
report file that's basically a list of all those sequences that comprise the assembly.
13:19
And if you scroll all the way down to the bottom,
13:25
you'll see the assembly data report, which describes all the genome assemblies
13:32
that are included in this package. So that's the end of the first example.
13:40
Okay. So next we're going to talk about SARS-CoV-2 genomes. So this example is focused on SARS-CoV-2
13:49
genome data. We'll show you how to download genome data then filter the associated metadata.
13:56
Specifically we'll show you how to pull out SARS-CoV-2 genome metadata
14:00
and sequence data for one SARS-CoV-2 lineage, the lambda variant of interest.
14:06
So first I'll show you how to download data for all SARS-CoV-2 genomes.
14:22
So this is the command you'd use to download all SARS-CoV-2 genomes.
14:28
datasets download virus genome taxon sars-cov-2
14:29
We are now up to over 1.5 million genomes. So this download can take a little while depending on your
14:36
Internet connection. For the purpose of today's demo, we are going to use a smaller package of
14:42
SARS-CoV-2 genomes that were released during a three day period last week which contains just
14:48
about 65,000 genomes. So I've already downloaded the zip file and extracted the zip archive.
14:56
So we are going to use tree again to take a look at what's inside what I've downloaded.
15:06
So you'll see there's just a few files in the data directory. Compared to that genome data
15:15
package that we downloaded in the last example. I just want you to pay attention to two files here.
15:21
There is the data report that includes metadata and annotation data for each viral genome,
15:29
and genomic.fna includes all 65,000 genome sequences. So now I'm going to
15:38
introduce the other command line tool that we've developed. It's called dataformat.
15:42
This is specifically for generating tables based on the metadata in this data report file here.
15:53
So we'll go through each part of this command and what it means.
15:56
dataformat tsv virus-genome --inputfile ...data_report.jsonl
15:56
--fields isolate-lineage, accession, geo-location, isolate-collection-date, virus-pangolin
15:56
So dataformat is the name of the tool. TSV means we want the output to be a tab delimited table,
16:04
and virus genome specifies the type of data report we'll be using as input.
16:11
We'll use the input file parameter to specify the path to the data report itself,
16:19
and then we'll use fields to specify all the columns that we want to include in our table.
16:27
So here we are including isolate-lineage, which is an identifier for the genome record. Accession
16:34
is the NCBI nucleotide accession. Geo-location is the geographic location
16:41
where the virus was isolated. Isolate-collection-date is when it was found,
16:48
and virus pangolin is the pango lineage name that we've calculated for that genome.
16:56
Then we are going to use head to show the first five rows of that table.
17:05
So you can see there's four genomes here, a little bit hard to read. The last column is kind of going
17:14
over the next row, but if you've looked at these pango lineages before, you might recognize this,
17:20
B.1.617.2 as the so-called delta variant. Next we're going to pull out metadata for
17:30
all SARS-CoV-2 genomes, representing just the lambda variant of interest.
17:36
So one thing you should know about our data reports is that each line of the data report
17:42
represents a single record, where in this case a single SARS-CoV-2 genome.
17:48
So this allows you to use grep with an identifier or other string to pull out
17:53
all the metadata for a genome containing that identifier.
18:03
So in this case we want to get all the metadata for the lambda variant of interest genomes.
18:09
We don't have the W.H.O. term, lambda, in our data reports,
18:13
but we do have the pango lineage names, and for lambda that's C 37. So we'll grep for "C\.37"
18:20
to pull out the metadata for those lambda variant genomes. And then we'll save that to a file,
18:30
lambda.jsonl. So now let's use -- it's taking a little bit longer than expected. Here we go. All
18:39
right. Now we are going to use dataformat and take a look at the metadata we saved.
18:44
dataformat tsv virus-genome --inputfile lambda.jsonl
18:45
--fields isolate-lineage,accession,
18:46
geo-location,isolate-collection-date,virus-pangolin
18:47
So we are going to basically run the same dataformat command that we just ran, but
18:54
now we are going to use this new file that we've generated as the input file. So let's see what
19:01
that looks like. All right. You can see that -- if you look at this last column you can see now we
19:08
only have the lambda variant genomes or the C.37 genomes. Next I'm going to show you how we can use
19:17
dataformat and samtools to pull out the genome sequences for these lambda variant genomes.
19:24
So first we'll use dataformat to get the accessions only from our
19:28
metadata file. Then we'll use that list of accessions to specify which genome sequences
19:33
to pull out of our genome sequence file. So first we'll use dataformat again.
19:40
dataformat tsv virus-genome --inputfile lambda.jsonl
19:42
--fields accession -elide-header
19:43
This time we'll specify just a single field, accession, to get a single column table of
19:49
accessions only. And we are also going to use the elide-header flag to remove the column name,
19:57
which could cause problems with samtools. So I'll just show you what the first five accessions looks
20:05
like. Okay, now we are going to go ahead and save that list as a file, lambda_accessions.txt.
20:17
Now we are going to use samtools to extract the lambda sequences
20:21
from the original genomic file that we downloaded as part of the package.
20:26
samtools faidx --region-file lambda_accessions.txt
20:29
--output lambda_genomes.fna .../genomic.fna
20:31
So here is the samtools command we are using.
20:36
We're going to specify the file with the accession list. This is the original genomic sequence file,
20:47
and then we are going to save the out put as lambda genomes.FNA. All right. And then let's see
20:58
what you can do with that file. Okay. We are going to just drag and drop that sequence file into the
21:10
Nextclade website. And give it a few seconds. This is kind of a nice way to visualize these genomes.
21:28
And it's nice to see that they've also identified as lambda variant genomes,
21:35
and this kind of gives you a nice visualization of other mutations that are present in genomes, and
21:40
some ideas about genome quality.
21:45
And that concludes the second example. So I think we can take some questions.
21:51
[Peter Cooper] Hi Eric, thanks. Great. There's a few questions.
21:58
One, there was a little bit of discussion about installing this on a cluster, and I
22:06
wanted Brad to comment on that, because he had a couple thoughts about that. Brad, are you there?
22:12
[Brad Holmes] I'm happy to speak to that. So I'll be honest,
22:17
we have a different kind of cluster here, and in our case we'd be responsible for either,
22:22
every time we submit a job, ensuring that the app gets installed for conda or through some
22:28
direct download link, the quick starts link that's also in the answer. So that's one option, or you
22:36
could talk to systems and say we want this sort of globally installed just like you install any other
22:40
package, and they could follow the similar steps to ensure it's available on every machine. So I
22:46
think the answers kind of sort of very on what the administrators want to do with the cluster.
22:50
[Peter Cooper] Okay, and you had some comments about Rake, too.
22:56
[Brad Holmes] That's the other thing,
23:01
these tools are rate limited. It's possible to get an API key. We are working on setting up API
23:08
keys through My NCBI, which would allow a slightly higher rate,
23:12
but we do not ever intend to support 400 concurrent connections from the same API key,
23:19
all downloading a package. So it would be bad practice to say all right, well, I want to start
23:24
400 jobs and they all go download 1 million sequences, the same million sequences and they
23:30
sort of in parallel operate on that package. It would be better practice to have a single task,
23:36
to first go get that package of data and then send that data out to the 400 jobs or
23:43
however you share data on your cluster, and they can then operate on the same package. So if we do
23:50
see ... service is going to degrade, if there is that many concurrent connections coming in.
23:56
[Peter Cooper] Okay. Thanks, Brad. And another
24:00
question that was here is about downloading by accession dot version, and Brad chatted
24:05
to me that he didn't think it was possible, so Eric, do you have any comments on that?
24:11
In other words, can you just specify the accession dot version -- I think this is what the person
24:14
wanted. Can you give it an accession.version and use datasets to download something that way?
24:22
[Eric Cox] You can, only with assembly accessions,
24:26
genome assembly accessions, but you can't do that yet for SARS-CoV-2 genomes.
24:32
[Brad Holmes]
24:33
You can also download gene data by RefSeq accession.versions. But a question came
24:41
in while we were looking at viruses, so in terms of the virus genomic RefSeq accession,
24:46
there is not a way to access that data. If that's useful, we'd like to know.
24:50
[Peter Cooper]
24:52
Okay, and there was another -- here's a couple more that have come in. But one of them had to
24:58
do with how quickly sequences that are submitted, I think that's what the question is really about,
25:03
how quickly they are turned around. So we are talking about SARS-CoV-2. This may be a question
25:08
for a group at NCBI. If it is, I'll get the answer from them, too. But how quickly does a sequence
25:15
get from being submitted to being available on datasets for the SARS-CoV-2 sequences?
25:21
[Nuala O'leary] I can answer that.
25:23
So when sequences gets submitted to genbank, they're going to get picked up by NCBI Virus,
25:29
if you're aware of that resource, and we pull them from NCBI Virus.
25:35
Brad may be able to comment on exactly how long it takes, but we do get our virus data set every day,
25:43
so we should pick up anything that's new in NCBI virus. And I believe that also gets updated every
25:48
day, or a couple times a day. And Brad, if you have any more information on that --
25:53
[Brad Holmes] I don't. It's so many factors involved,
25:56
how quickly the labs turn over the data, how quickly it goes through our submission system,
26:00
how many other submissions we received the same day. We are working to get
26:04
out as fast as possible, 24 hours is certainly possible. But there are no guarantees on that.
26:10
[Peter Cooper] And somebody asked a
26:13
question that's probably worth addressing. You mentioned there are 1.5 million
26:18
genomes for SARS-CoV-2 and the question is, are they all unique or are they
26:23
duplicate sequences isolated from different individuals? And Brad sort of answered that.
26:28
[Nuala O'leary]
26:30
They are unique submissions, but the sequences can be identical.
26:33
[Peter Cooper] Okay, and there was another
26:39
question that -- just a general one about how to find out about webinars and workshops like this.
26:46
And Eric, did you have that slide at the end that had the sort of
26:50
links to the social media and things like that? You kept that in your slide?
26:53
[Eric Cox] Yes, this one?
26:56
[Peter Cooper] ust to point out to you that
27:00
we have social media presence and we announce all these things on social media,
27:06
on Twitter, on Facebook, and on LinkedIn. We also have a blog, NCBI Insights, so if
27:13
you go to that blog and and you have a WordPress account you can subscribe to that blog as well.
27:20
There is also an old-fashioned email list that somebody asking about, and that's
27:27
the NCBI announce email list, so if that's something you want, I could put that URL into
27:34
the chat to everybody and that's how you get to it if you want to subscribe to that email list.
27:39
ncbi-announce@ncbi.nlm.nih.gov
27:40
Okay, and I think we covered all the questions that you have in the chat right now. If there's
27:46
any that we didn't get to answer, I'll make sure that they're in the document.
27:52
Thanks, everybody, for coming. Thanks, Brad, Nuala, and Eric for being here.
27:58
[Eric Cox] Thank you.
27:59
[Peter Cooper] Okay, thanks a lot.