00:02
[Peter Cooper] Okay, so I'm going to go ahead and turn this  
00:04
over to Eric and he's going to take you through  datasets and online tools. Take it away, Eric.
00:09
[Eric Cox] Thank you,  
00:10
Peter. Thanks, everybody, for coming  today. Welcome to the webinar on using  
00:16
datasets commandline tools to  download and use genome metadata.  
00:20
So here is an outline of the webinar today.  First I'm going to introduce the datasets project  
00:27
and our command-line tools . Then I'm going to go  over two examples of using our command-line tools.  
00:35
So first I'm going to show you one way  to get high quality butterfly genomes,  
00:41
and then I'm going to show you how to  find SARS-CoV-2 lambda variant genomes.  
00:47
So the butterfly genome example will  hopefully give you some general ideas  
00:51
about how you can get genome metadata and  the sequence data for really any organism,  
00:57
while the SARS-CoV-2 example is a little  more specifically focused on SARS-CoV-2 data.  
01:06
So datasets is a relatively new effort at NCBI  
01:10
that's focused on making sequence  data easy to find and download.
01:15
We have web pages including our home page on here,  
01:20
command-line tools, that I'm going to talk  about today, and documented public rest API.  
01:27
A big emphasis of the project is  providing easier access to genome data,  
01:32
including genome sequence, genome annotation,  and transcript and protein sequences.  
01:41
So today's presentation is just focused on  accessing genome data using our command-line  
01:47
tools. So as I said, we are looking at two  examples of using our command-line tools . In  
01:54
both examples we'll use the datasets tool. This  is for downloading biological sequence data,  
02:01
and in the second example we'll also talk  about dataformat, which is used to generate  
02:06
human readable metadata tables. We don't have time  today to go over installing the tools, but you can  
02:15
download them from the datasets website, and we  have recently made them available through conda.  
02:25
We are also going to use two other third party  tools, mostly jq and we'll also touch a little bit  
02:33
on samtools. jq is a great tool for  working with JSON formatted data.  
02:41
The datasets command-line tool delivers metadata  in two flavors of JSON, JSON and JSON lines.
02:49
So just a quick note about versions. Today  we're talking about version 12 of the datasets  
02:56
command-line tools. These are expected to  remain stable for the foreseeable future,  
03:03
and these are based on version 1 of our API. And  at the moment we are starting development on API  
03:12
version 2, which will power future  releases of the command-line tools,  
03:17
and we are planning a lot of new features  and some breaking changes. This is a great  
03:22
time to give us feedback on where we'll be going  next as we start planning version 2 of the API.
03:31
So for the first example, we are going to  download metadata describing butterfly genomes,  
03:38
then use information in this metadata to help us  identify and download a subset of high quality  
03:44
genomes. We are going to use a built-in filter  to get butterfly genomes at the chromosome  
03:51
assembly level . Then we're going to further  restrict this set of chromosome level butterfly  
03:57
genome assemblies to those above a certain  contig N50 value. Along the way we'll show  
04:05
you how to get genome metadata, how to convert  genome metadata into human readable formats,  
04:11
how you can filter the metadata, and finally,  how to download a set of genome assemblies.
04:16
We are going to start by downloading  genome metadata for all butterflies,  
04:20
so we are going to go ahead and  switch to the terminal window now.  
04:26
Okay. So one way to get genome metadata  is by using the datasets summary command.  
04:34
So I'm going to go through each part of the  command and describe what it means. So datasets  
04:40
summary means that we are going to get metadata,  genome specifies the type of metadata, and taxon  
04:49
tells you what kind of a query you are going to  use to get this genome metadata. So this can be a  
04:55
taxonomic name or NCBI taxonomic I.D. In this case  we are just going to use the name butterflies.  
05:04
The assembly source parameter allows you to  specify whether you want data for either GenBank  
05:10
or RefSeq assemblies. Today we're just going to  look at GenBank assemblies, keep things simple.  
05:17
And we are going to run this command and save  the output to a file, butterfly_summary.json.
05:25
So let's start with the simple question about  this metadata. How many genomes are there?  
05:34
So here's one way to get the genome count. We are  going to use jq to pretty print the data to make  
05:42
it easier to read and then we are going to pipe  this to tail to show the last 10 lines of this  
05:49
pretty printed data. Because the genome count is  always shown at the end of the genome metadata  
05:56
that you get using a summary command, this command  will let you see the number of butterfly genomes  
06:01
described by the metadata. So you can see here it  just shows total count, 562, so that means there's  
06:09
562 genomes. Another way to get the genome count  is to specify the name of this total count field,  
06:16
and using the power of jq, just  pull the value out directly.  
06:22
So here we just run jq and specify the field name,  total count, and the file name. And you'll just  
06:30
get that genome count back directly. So since we  are looking for high quality butterfly genomes,  
06:38
we can use the assembly level filter to only get  chromosome level butterfly genome assemblies.  
06:47
So we are going to run this summary command  again, but this time we'll use two filters.  
06:53
One filter to get genbank assemblies and one  filter to get chromosome level assemblies.  
07:03
Okay. So we've saved that metadata to  a file, butterfly chrome level.json.  
07:09
Now we are going to use jq, specifying the  field name and total count to see how many  
07:17
genomes -- how many butterfly genomes  are at the chromosome assembly level?  
07:23
So you can see that number is a lot smaller than  what we had before. So we've reduced the set of  
07:28
genomes from almost 600 genomes to less than  50. So it's great when you need to filter by  
07:35
some attributes where we already have a built-in  filter, like assembly level or assembly source.
07:41
You also have filters to only get annotated  genomes or get genomes released before or  
07:46
after a certain date. But sometimes you may need  to filter by other attributes of a genome record  
07:52
where we don't have built in filters available.  So I'll show you how to do that using jq.
07:59
People often use contig N50 value  as a measure of genome quality,  
08:05
so to further narrow down our set of  butterfly genomes, we are going to  
08:09
take the list of chromosome level butterfly  genome assemblies we already have, and only  
08:15
keep the genomes where the contig  N50 value is over 1.5 million.
08:21
So I'm going to show you how to make  a simple table of assembly accessions  
08:26
and contig N50 values using jq.
08:34
So basically what this command is saying is take  this metadata describing the chromosome level  
08:41
butterfly assemblies, send it to jq, and then for  each assembly, give me back the assembly accession  
08:53
and the contig N50 value. And then generate  a tab delimited table in CSV format and then  
09:04
I'm going to pipe that to head to just  show the first five rows of that table.  
09:10
So now you have this two column  table. The first column has  
09:14
assembly accessions, the second  column shows you contig N50 values.
09:19
So next we're going to get the  genome accessions for those  
09:22
genomes with a contig N50 value over 1.5 million.  
09:31
so again, we are sending the metadata  for all chromosome level butterfly  
09:36
genome assemblies to jq . Then this jq  command for all assemblies in the set,  
09:43
just give me back the assemblies  with a contig N50 value  
09:48
over 1.5 million, and then for each of those  assemblies, just have me the assembly accession.  
09:58
So this is a whole list of butterfly  genome assemblies that are both  
10:04
chromosome level assemblies with  contig N50 values over 1.5 million.
10:11
And in case you're curious where these genomes  came from, these are all from the Darwin tree  
10:15
of life project, which is a project to  try and sequence 24,000 eukaryotes from  
10:20
Britain and Ireland. Next we're  going to save this list to a file.
10:28
So I'm just going to run that same command again  and save it to a file called accession list.  
10:36
Now that we have this list of assembly accessions,  
10:39
we can download the genome sequences for  these genomes. So this takes a few minutes,  
10:44
so I've already downloaded it in advance, but  I'm going to show you the command that I used.  
10:52
So this is what the download command looks like.  I'm going to go through each part of the command,  
10:58
explain what it means. This says download  genome data given one or more accessions.
11:06
datasets download genome accession
11:06
In this case we are using a text file  with a list of accessions we generated  
11:11
previously to specify which genomes we want. So  then why do you need all these exclude flags?
11:20
So datasets delivers our sequence and annotation  data in a zip archive that we call a data package.  
11:27
By default, these packages will include a  bunch of different types of sequence data,  
11:32
annotation data, and metadata.
11:35
So for example, for an annotated eukaryote genome,  
11:39
we would include the genome  sequence, transcript sequences,  
11:44
annotation in GFF three format and metadata in  one or more files we call data reports. So if  
11:52
you only want to get genome sequence, you can add  various exclude flags to, for example, exclude the  
12:00
protein sequence, exclude the annotation file,  and exclude the RNA or transcript sequences.
12:08
So let's take a quick look at what's in  this downloaded package of genome sequence.  
12:14
I'm going to use the tree command to show you  
12:18
the directory structure and some of the files  that are included in the download package.  
12:23
So as I said, I've already downloaded this  zip archive and extracted it on my computer.  
12:31
We are going to run tree on this directory that  contains the contents of that zip archive and then  
12:38
pipe that to less so we can  scroll through that output.
12:44
So you'll notice that data for each genome  assembly is in a directory that's named with  
12:51
that assembly accession. So in this case we  asked for just genome sequence, so you'll see  
12:58
fasta files in each of these directories  representing each assembled chromosome,  
13:04
sequences from unplaced and unlocalized  sequences are in separate files.
13:10
And in each directory representing a  genome assembly, there's a sequence  
13:14
report file that's basically a list of all  those sequences that comprise the assembly.  
13:19
And if you scroll all the way down to the bottom,  
13:25
you'll see the assembly data report,  which describes all the genome assemblies  
13:32
that are included in this package. So  that's the end of the first example.  
13:40
Okay. So next we're going to talk about SARS-CoV-2  genomes. So this example is focused on SARS-CoV-2  
13:49
genome data. We'll show you how to download  genome data then filter the associated metadata.  
13:56
Specifically we'll show you how to  pull out SARS-CoV-2 genome metadata  
14:00
and sequence data for one SARS-CoV-2  lineage, the lambda variant of interest.
14:06
So first I'll show you how to download  data for all SARS-CoV-2 genomes.  
14:22
So this is the command you'd use  to download all SARS-CoV-2 genomes.
14:28
datasets download virus genome taxon sars-cov-2
14:29
We are now up to over 1.5 million genomes. So this  download can take a little while depending on your  
14:36
Internet connection. For the purpose of today's  demo, we are going to use a smaller package of  
14:42
SARS-CoV-2 genomes that were released during a  three day period last week which contains just  
14:48
about 65,000 genomes. So I've already downloaded  the zip file and extracted the zip archive.
14:56
So we are going to use tree again to take a  look at what's inside what I've downloaded.  
15:06
So you'll see there's just a few files in the  data directory. Compared to that genome data  
15:15
package that we downloaded in the last example. I  just want you to pay attention to two files here.  
15:21
There is the data report that includes metadata  and annotation data for each viral genome,  
15:29
and genomic.fna includes all 65,000  genome sequences. So now I'm going to  
15:38
introduce the other command line tool that  we've developed. It's called dataformat.  
15:42
This is specifically for generating tables based  on the metadata in this data report file here.  
15:53
So we'll go through each part of  this command and what it means.
15:56
dataformat tsv virus-genome --inputfile ...data_report.jsonl 
15:56
--fields isolate-lineage, accession, geo-location,  isolate-collection-date, virus-pangolin
15:56
So dataformat is the name of the tool. TSV means  we want the output to be a tab delimited table,  
16:04
and virus genome specifies the type of  data report we'll be using as input.  
16:11
We'll use the input file parameter to  specify the path to the data report itself,  
16:19
and then we'll use fields to specify all the  columns that we want to include in our table.
16:27
So here we are including isolate-lineage, which  is an identifier for the genome record. Accession  
16:34
is the NCBI nucleotide accession.  Geo-location is the geographic location  
16:41
where the virus was isolated.  Isolate-collection-date is when it was found,  
16:48
and virus pangolin is the pango lineage  name that we've calculated for that genome.  
16:56
Then we are going to use head to show  the first five rows of that table.
17:05
So you can see there's four genomes here, a little  bit hard to read. The last column is kind of going  
17:14
over the next row, but if you've looked at these  pango lineages before, you might recognize this,  
17:20
B.1.617.2 as the so-called delta variant.  Next we're going to pull out metadata for  
17:30
all SARS-CoV-2 genomes, representing  just the lambda variant of interest.
17:36
So one thing you should know about our data  reports is that each line of the data report  
17:42
represents a single record, where in  this case a single SARS-CoV-2 genome.  
17:48
So this allows you to use grep with an  identifier or other string to pull out  
17:53
all the metadata for a genome  containing that identifier.  
18:03
So in this case we want to get all the metadata  for the lambda variant of interest genomes.  
18:09
We don't have the W.H.O. term,  lambda, in our data reports,  
18:13
but we do have the pango lineage names, and for  lambda that's C 37. So we'll grep for "C\.37"  
18:20
to pull out the metadata for those lambda variant  genomes. And then we'll save that to a file,  
18:30
lambda.jsonl. So now let's use -- it's taking a  little bit longer than expected. Here we go. All  
18:39
right. Now we are going to use dataformat  and take a look at the metadata we saved.
18:44
dataformat tsv virus-genome --inputfile lambda.jsonl 
18:45
--fields isolate-lineage,accession,  
18:46
geo-location,isolate-collection-date,virus-pangolin
18:47
So we are going to basically run the same  dataformat command that we just ran, but  
18:54
now we are going to use this new file that we've  generated as the input file. So let's see what  
19:01
that looks like. All right. You can see that --  if you look at this last column you can see now we  
19:08
only have the lambda variant genomes or the C.37  genomes. Next I'm going to show you how we can use  
19:17
dataformat and samtools to pull out the genome  sequences for these lambda variant genomes.
19:24
So first we'll use dataformat to  get the accessions only from our  
19:28
metadata file. Then we'll use that list of  accessions to specify which genome sequences  
19:33
to pull out of our genome sequence file.  So first we'll use dataformat again.
19:40
dataformat tsv virus-genome --inputfile lambda.jsonl 
19:42
--fields accession -elide-header
19:43
This time we'll specify just a single field,  accession, to get a single column table of  
19:49
accessions only. And we are also going to use  the elide-header flag to remove the column name,  
19:57
which could cause problems with samtools. So I'll  just show you what the first five accessions looks  
20:05
like. Okay, now we are going to go ahead and  save that list as a file, lambda_accessions.txt.
20:17
Now we are going to use samtools  to extract the lambda sequences  
20:21
from the original genomic file that  we downloaded as part of the package.
20:26
samtools faidx --region-file lambda_accessions.txt 
20:29
--output lambda_genomes.fna .../genomic.fna
20:31
So here is the samtools command we are using.  
20:36
We're going to specify the file with the accession  list. This is the original genomic sequence file,  
20:47
and then we are going to save the out put as  lambda genomes.FNA. All right. And then let's see  
20:58
what you can do with that file. Okay. We are going  to just drag and drop that sequence file into the  
21:10
Nextclade website. And give it a few seconds. This  is kind of a nice way to visualize these genomes.
21:28
And it's nice to see that they've also  identified as lambda variant genomes,  
21:35
and this kind of gives you a nice visualization of  other mutations that are present in genomes, and  
21:40
some ideas about genome quality.  
21:45
And that concludes the second example.  So I think we can take some questions.
21:51
[Peter Cooper] Hi Eric, thanks. Great. There's a few questions.  
21:58
One, there was a little bit of discussion  about installing this on a cluster, and I  
22:06
wanted Brad to comment on that, because he had a  couple thoughts about that. Brad, are you there?
22:12
[Brad Holmes] I'm happy to speak to that. So I'll be honest,  
22:17
we have a different kind of cluster here, and  in our case we'd be responsible for either,  
22:22
every time we submit a job, ensuring that the  app gets installed for conda or through some  
22:28
direct download link, the quick starts link that's  also in the answer. So that's one option, or you  
22:36
could talk to systems and say we want this sort of  globally installed just like you install any other  
22:40
package, and they could follow the similar steps  to ensure it's available on every machine. So I  
22:46
think the answers kind of sort of very on what  the administrators want to do with the cluster.
22:50
[Peter Cooper] Okay, and you had some comments about Rake, too.
22:56
[Brad Holmes] That's the other thing,  
23:01
these tools are rate limited. It's possible to  get an API key. We are working on setting up API  
23:08
keys through My NCBI, which would  allow a slightly higher rate,  
23:12
but we do not ever intend to support 400  concurrent connections from the same API key,  
23:19
all downloading a package. So it would be bad  practice to say all right, well, I want to start  
23:24
400 jobs and they all go download 1 million  sequences, the same million sequences and they  
23:30
sort of in parallel operate on that package. It  would be better practice to have a single task,  
23:36
to first go get that package of data and  then send that data out to the 400 jobs or  
23:43
however you share data on your cluster, and they  can then operate on the same package. So if we do  
23:50
see ... service is going to degrade, if there  is that many concurrent connections coming in.
23:56
[Peter Cooper] Okay. Thanks, Brad. And another  
24:00
question that was here is about downloading  by accession dot version, and Brad chatted  
24:05
to me that he didn't think it was possible,  so Eric, do you have any comments on that?
24:11
In other words, can you just specify the accession  dot version -- I think this is what the person  
24:14
wanted. Can you give it an accession.version  and use datasets to download something that way?
24:22
[Eric Cox] You can, only with assembly accessions,  
24:26
genome assembly accessions, but you  can't do that yet for SARS-CoV-2 genomes.
24:32
[Brad Holmes] 
24:33
You can also download gene data by RefSeq  accession.versions. But a question came  
24:41
in while we were looking at viruses, so in  terms of the virus genomic RefSeq accession,  
24:46
there is not a way to access that data.  If that's useful, we'd like to know.
24:50
[Peter Cooper] 
24:52
Okay, and there was another -- here's a couple  more that have come in. But one of them had to  
24:58
do with how quickly sequences that are submitted,  I think that's what the question is really about,  
25:03
how quickly they are turned around. So we are  talking about SARS-CoV-2. This may be a question  
25:08
for a group at NCBI. If it is, I'll get the answer  from them, too. But how quickly does a sequence  
25:15
get from being submitted to being available  on datasets for the SARS-CoV-2 sequences?
25:21
[Nuala O'leary] I can answer that.  
25:23
So when sequences gets submitted to genbank,  they're going to get picked up by NCBI Virus,  
25:29
if you're aware of that resource,  and we pull them from NCBI Virus.  
25:35
Brad may be able to comment on exactly how long it  takes, but we do get our virus data set every day,  
25:43
so we should pick up anything that's new in NCBI  virus. And I believe that also gets updated every  
25:48
day, or a couple times a day. And Brad, if  you have any more information on that --
25:53
[Brad Holmes] I don't. It's so many factors involved,  
25:56
how quickly the labs turn over the data, how  quickly it goes through our submission system,  
26:00
how many other submissions we received  the same day. We are working to get  
26:04
out as fast as possible, 24 hours is certainly  possible. But there are no guarantees on that.
26:10
[Peter Cooper] And somebody asked a  
26:13
question that's probably worth addressing.  You mentioned there are 1.5 million  
26:18
genomes for SARS-CoV-2 and the question  is, are they all unique or are they  
26:23
duplicate sequences isolated from different  individuals? And Brad sort of answered that.
26:28
[Nuala O'leary] 
26:30
They are unique submissions, but  the sequences can be identical.
26:33
[Peter Cooper] Okay, and there was another  
26:39
question that -- just a general one about how to  find out about webinars and workshops like this.  
26:46
And Eric, did you have that slide  at the end that had the sort of  
26:50
links to the social media and things  like that? You kept that in your slide?
26:53
[Eric Cox] Yes, this one?
26:56
[Peter Cooper] ust to point out to you that  
27:00
we have social media presence and we  announce all these things on social media,  
27:06
on Twitter, on Facebook, and on LinkedIn.  We also have a blog, NCBI Insights, so if  
27:13
you go to that blog and and you have a WordPress  account you can subscribe to that blog as well.  
27:20
There is also an old-fashioned email list  that somebody asking about, and that's  
27:27
the NCBI announce email list, so if that's  something you want, I could put that URL into  
27:34
the chat to everybody and that's how you get to  it if you want to subscribe to that email list.
27:39
ncbi-announce@ncbi.nlm.nih.gov
27:40
Okay, and I think we covered all the questions  that you have in the chat right now. If there's  
27:46
any that we didn't get to answer, I'll  make sure that they're in the document.  
27:52
Thanks, everybody, for coming. Thanks,  Brad, Nuala, and Eric for being here.
27:58
[Eric Cox] Thank you.
27:59
[Peter Cooper] Okay, thanks a lot.