00:02 [Peter Cooper] Okay, so I'm going to go ahead and turn this 00:04 over to Eric and he's going to take you through datasets and online tools. Take it away, Eric. 00:09 [Eric Cox] Thank you, 00:10 Peter. Thanks, everybody, for coming today. Welcome to the webinar on using 00:16 datasets commandline tools to download and use genome metadata. 00:20 So here is an outline of the webinar today. First I'm going to introduce the datasets project 00:27 and our command-line tools . Then I'm going to go over two examples of using our command-line tools. 00:35 So first I'm going to show you one way to get high quality butterfly genomes, 00:41 and then I'm going to show you how to find SARS-CoV-2 lambda variant genomes. 00:47 So the butterfly genome example will hopefully give you some general ideas 00:51 about how you can get genome metadata and the sequence data for really any organism, 00:57 while the SARS-CoV-2 example is a little more specifically focused on SARS-CoV-2 data. 01:06 So datasets is a relatively new effort at NCBI 01:10 that's focused on making sequence data easy to find and download. 01:15 We have web pages including our home page on here, 01:20 command-line tools, that I'm going to talk about today, and documented public rest API. 01:27 A big emphasis of the project is providing easier access to genome data, 01:32 including genome sequence, genome annotation, and transcript and protein sequences. 01:41 So today's presentation is just focused on accessing genome data using our command-line 01:47 tools. So as I said, we are looking at two examples of using our command-line tools . In 01:54 both examples we'll use the datasets tool. This is for downloading biological sequence data, 02:01 and in the second example we'll also talk about dataformat, which is used to generate 02:06 human readable metadata tables. We don't have time today to go over installing the tools, but you can 02:15 download them from the datasets website, and we have recently made them available through conda. 02:25 We are also going to use two other third party tools, mostly jq and we'll also touch a little bit 02:33 on samtools. jq is a great tool for working with JSON formatted data. 02:41 The datasets command-line tool delivers metadata in two flavors of JSON, JSON and JSON lines. 02:49 So just a quick note about versions. Today we're talking about version 12 of the datasets 02:56 command-line tools. These are expected to remain stable for the foreseeable future, 03:03 and these are based on version 1 of our API. And at the moment we are starting development on API 03:12 version 2, which will power future releases of the command-line tools, 03:17 and we are planning a lot of new features and some breaking changes. This is a great 03:22 time to give us feedback on where we'll be going next as we start planning version 2 of the API. 03:31 So for the first example, we are going to download metadata describing butterfly genomes, 03:38 then use information in this metadata to help us identify and download a subset of high quality 03:44 genomes. We are going to use a built-in filter to get butterfly genomes at the chromosome 03:51 assembly level . Then we're going to further restrict this set of chromosome level butterfly 03:57 genome assemblies to those above a certain contig N50 value. Along the way we'll show 04:05 you how to get genome metadata, how to convert genome metadata into human readable formats, 04:11 how you can filter the metadata, and finally, how to download a set of genome assemblies. 04:16 We are going to start by downloading genome metadata for all butterflies, 04:20 so we are going to go ahead and switch to the terminal window now. 04:26 Okay. So one way to get genome metadata is by using the datasets summary command. 04:34 So I'm going to go through each part of the command and describe what it means. So datasets 04:40 summary means that we are going to get metadata, genome specifies the type of metadata, and taxon 04:49 tells you what kind of a query you are going to use to get this genome metadata. So this can be a 04:55 taxonomic name or NCBI taxonomic I.D. In this case we are just going to use the name butterflies. 05:04 The assembly source parameter allows you to specify whether you want data for either GenBank 05:10 or RefSeq assemblies. Today we're just going to look at GenBank assemblies, keep things simple. 05:17 And we are going to run this command and save the output to a file, butterfly_summary.json. 05:25 So let's start with the simple question about this metadata. How many genomes are there? 05:34 So here's one way to get the genome count. We are going to use jq to pretty print the data to make 05:42 it easier to read and then we are going to pipe this to tail to show the last 10 lines of this 05:49 pretty printed data. Because the genome count is always shown at the end of the genome metadata 05:56 that you get using a summary command, this command will let you see the number of butterfly genomes 06:01 described by the metadata. So you can see here it just shows total count, 562, so that means there's 06:09 562 genomes. Another way to get the genome count is to specify the name of this total count field, 06:16 and using the power of jq, just pull the value out directly. 06:22 So here we just run jq and specify the field name, total count, and the file name. And you'll just 06:30 get that genome count back directly. So since we are looking for high quality butterfly genomes, 06:38 we can use the assembly level filter to only get chromosome level butterfly genome assemblies. 06:47 So we are going to run this summary command again, but this time we'll use two filters. 06:53 One filter to get genbank assemblies and one filter to get chromosome level assemblies. 07:03 Okay. So we've saved that metadata to a file, butterfly chrome level.json. 07:09 Now we are going to use jq, specifying the field name and total count to see how many 07:17 genomes -- how many butterfly genomes are at the chromosome assembly level? 07:23 So you can see that number is a lot smaller than what we had before. So we've reduced the set of 07:28 genomes from almost 600 genomes to less than 50. So it's great when you need to filter by 07:35 some attributes where we already have a built-in filter, like assembly level or assembly source. 07:41 You also have filters to only get annotated genomes or get genomes released before or 07:46 after a certain date. But sometimes you may need to filter by other attributes of a genome record 07:52 where we don't have built in filters available. So I'll show you how to do that using jq. 07:59 People often use contig N50 value as a measure of genome quality, 08:05 so to further narrow down our set of butterfly genomes, we are going to 08:09 take the list of chromosome level butterfly genome assemblies we already have, and only 08:15 keep the genomes where the contig N50 value is over 1.5 million. 08:21 So I'm going to show you how to make a simple table of assembly accessions 08:26 and contig N50 values using jq. 08:34 So basically what this command is saying is take this metadata describing the chromosome level 08:41 butterfly assemblies, send it to jq, and then for each assembly, give me back the assembly accession 08:53 and the contig N50 value. And then generate a tab delimited table in CSV format and then 09:04 I'm going to pipe that to head to just show the first five rows of that table. 09:10 So now you have this two column table. The first column has 09:14 assembly accessions, the second column shows you contig N50 values. 09:19 So next we're going to get the genome accessions for those 09:22 genomes with a contig N50 value over 1.5 million. 09:31 so again, we are sending the metadata for all chromosome level butterfly 09:36 genome assemblies to jq . Then this jq command for all assemblies in the set, 09:43 just give me back the assemblies with a contig N50 value 09:48 over 1.5 million, and then for each of those assemblies, just have me the assembly accession. 09:58 So this is a whole list of butterfly genome assemblies that are both 10:04 chromosome level assemblies with contig N50 values over 1.5 million. 10:11 And in case you're curious where these genomes came from, these are all from the Darwin tree 10:15 of life project, which is a project to try and sequence 24,000 eukaryotes from 10:20 Britain and Ireland. Next we're going to save this list to a file. 10:28 So I'm just going to run that same command again and save it to a file called accession list. 10:36 Now that we have this list of assembly accessions, 10:39 we can download the genome sequences for these genomes. So this takes a few minutes, 10:44 so I've already downloaded it in advance, but I'm going to show you the command that I used. 10:52 So this is what the download command looks like. I'm going to go through each part of the command, 10:58 explain what it means. This says download genome data given one or more accessions. 11:06 datasets download genome accession 11:06 In this case we are using a text file with a list of accessions we generated 11:11 previously to specify which genomes we want. So then why do you need all these exclude flags? 11:20 So datasets delivers our sequence and annotation data in a zip archive that we call a data package. 11:27 By default, these packages will include a bunch of different types of sequence data, 11:32 annotation data, and metadata. 11:35 So for example, for an annotated eukaryote genome, 11:39 we would include the genome sequence, transcript sequences, 11:44 annotation in GFF three format and metadata in one or more files we call data reports. So if 11:52 you only want to get genome sequence, you can add various exclude flags to, for example, exclude the 12:00 protein sequence, exclude the annotation file, and exclude the RNA or transcript sequences. 12:08 So let's take a quick look at what's in this downloaded package of genome sequence. 12:14 I'm going to use the tree command to show you 12:18 the directory structure and some of the files that are included in the download package. 12:23 So as I said, I've already downloaded this zip archive and extracted it on my computer. 12:31 We are going to run tree on this directory that contains the contents of that zip archive and then 12:38 pipe that to less so we can scroll through that output. 12:44 So you'll notice that data for each genome assembly is in a directory that's named with 12:51 that assembly accession. So in this case we asked for just genome sequence, so you'll see 12:58 fasta files in each of these directories representing each assembled chromosome, 13:04 sequences from unplaced and unlocalized sequences are in separate files. 13:10 And in each directory representing a genome assembly, there's a sequence 13:14 report file that's basically a list of all those sequences that comprise the assembly. 13:19 And if you scroll all the way down to the bottom, 13:25 you'll see the assembly data report, which describes all the genome assemblies 13:32 that are included in this package. So that's the end of the first example. 13:40 Okay. So next we're going to talk about SARS-CoV-2 genomes. So this example is focused on SARS-CoV-2 13:49 genome data. We'll show you how to download genome data then filter the associated metadata. 13:56 Specifically we'll show you how to pull out SARS-CoV-2 genome metadata 14:00 and sequence data for one SARS-CoV-2 lineage, the lambda variant of interest. 14:06 So first I'll show you how to download data for all SARS-CoV-2 genomes. 14:22 So this is the command you'd use to download all SARS-CoV-2 genomes. 14:28 datasets download virus genome taxon sars-cov-2 14:29 We are now up to over 1.5 million genomes. So this download can take a little while depending on your 14:36 Internet connection. For the purpose of today's demo, we are going to use a smaller package of 14:42 SARS-CoV-2 genomes that were released during a three day period last week which contains just 14:48 about 65,000 genomes. So I've already downloaded the zip file and extracted the zip archive. 14:56 So we are going to use tree again to take a look at what's inside what I've downloaded. 15:06 So you'll see there's just a few files in the data directory. Compared to that genome data 15:15 package that we downloaded in the last example. I just want you to pay attention to two files here. 15:21 There is the data report that includes metadata and annotation data for each viral genome, 15:29 and genomic.fna includes all 65,000 genome sequences. So now I'm going to 15:38 introduce the other command line tool that we've developed. It's called dataformat. 15:42 This is specifically for generating tables based on the metadata in this data report file here. 15:53 So we'll go through each part of this command and what it means. 15:56 dataformat tsv virus-genome --inputfile ...data_report.jsonl 15:56 --fields isolate-lineage, accession, geo-location, isolate-collection-date, virus-pangolin 15:56 So dataformat is the name of the tool. TSV means we want the output to be a tab delimited table, 16:04 and virus genome specifies the type of data report we'll be using as input. 16:11 We'll use the input file parameter to specify the path to the data report itself, 16:19 and then we'll use fields to specify all the columns that we want to include in our table. 16:27 So here we are including isolate-lineage, which is an identifier for the genome record. Accession 16:34 is the NCBI nucleotide accession. Geo-location is the geographic location 16:41 where the virus was isolated. Isolate-collection-date is when it was found, 16:48 and virus pangolin is the pango lineage name that we've calculated for that genome. 16:56 Then we are going to use head to show the first five rows of that table. 17:05 So you can see there's four genomes here, a little bit hard to read. The last column is kind of going 17:14 over the next row, but if you've looked at these pango lineages before, you might recognize this, 17:20 B.1.617.2 as the so-called delta variant. Next we're going to pull out metadata for 17:30 all SARS-CoV-2 genomes, representing just the lambda variant of interest. 17:36 So one thing you should know about our data reports is that each line of the data report 17:42 represents a single record, where in this case a single SARS-CoV-2 genome. 17:48 So this allows you to use grep with an identifier or other string to pull out 17:53 all the metadata for a genome containing that identifier. 18:03 So in this case we want to get all the metadata for the lambda variant of interest genomes. 18:09 We don't have the W.H.O. term, lambda, in our data reports, 18:13 but we do have the pango lineage names, and for lambda that's C 37. So we'll grep for "C\.37" 18:20 to pull out the metadata for those lambda variant genomes. And then we'll save that to a file, 18:30 lambda.jsonl. So now let's use -- it's taking a little bit longer than expected. Here we go. All 18:39 right. Now we are going to use dataformat and take a look at the metadata we saved. 18:44 dataformat tsv virus-genome --inputfile lambda.jsonl 18:45 --fields isolate-lineage,accession, 18:46 geo-location,isolate-collection-date,virus-pangolin 18:47 So we are going to basically run the same dataformat command that we just ran, but 18:54 now we are going to use this new file that we've generated as the input file. So let's see what 19:01 that looks like. All right. You can see that -- if you look at this last column you can see now we 19:08 only have the lambda variant genomes or the C.37 genomes. Next I'm going to show you how we can use 19:17 dataformat and samtools to pull out the genome sequences for these lambda variant genomes. 19:24 So first we'll use dataformat to get the accessions only from our 19:28 metadata file. Then we'll use that list of accessions to specify which genome sequences 19:33 to pull out of our genome sequence file. So first we'll use dataformat again. 19:40 dataformat tsv virus-genome --inputfile lambda.jsonl 19:42 --fields accession -elide-header 19:43 This time we'll specify just a single field, accession, to get a single column table of 19:49 accessions only. And we are also going to use the elide-header flag to remove the column name, 19:57 which could cause problems with samtools. So I'll just show you what the first five accessions looks 20:05 like. Okay, now we are going to go ahead and save that list as a file, lambda_accessions.txt. 20:17 Now we are going to use samtools to extract the lambda sequences 20:21 from the original genomic file that we downloaded as part of the package. 20:26 samtools faidx --region-file lambda_accessions.txt 20:29 --output lambda_genomes.fna .../genomic.fna 20:31 So here is the samtools command we are using. 20:36 We're going to specify the file with the accession list. This is the original genomic sequence file, 20:47 and then we are going to save the out put as lambda genomes.FNA. All right. And then let's see 20:58 what you can do with that file. Okay. We are going to just drag and drop that sequence file into the 21:10 Nextclade website. And give it a few seconds. This is kind of a nice way to visualize these genomes. 21:28 And it's nice to see that they've also identified as lambda variant genomes, 21:35 and this kind of gives you a nice visualization of other mutations that are present in genomes, and 21:40 some ideas about genome quality. 21:45 And that concludes the second example. So I think we can take some questions. 21:51 [Peter Cooper] Hi Eric, thanks. Great. There's a few questions. 21:58 One, there was a little bit of discussion about installing this on a cluster, and I 22:06 wanted Brad to comment on that, because he had a couple thoughts about that. Brad, are you there? 22:12 [Brad Holmes] I'm happy to speak to that. So I'll be honest, 22:17 we have a different kind of cluster here, and in our case we'd be responsible for either, 22:22 every time we submit a job, ensuring that the app gets installed for conda or through some 22:28 direct download link, the quick starts link that's also in the answer. So that's one option, or you 22:36 could talk to systems and say we want this sort of globally installed just like you install any other 22:40 package, and they could follow the similar steps to ensure it's available on every machine. So I 22:46 think the answers kind of sort of very on what the administrators want to do with the cluster. 22:50 [Peter Cooper] Okay, and you had some comments about Rake, too. 22:56 [Brad Holmes] That's the other thing, 23:01 these tools are rate limited. It's possible to get an API key. We are working on setting up API 23:08 keys through My NCBI, which would allow a slightly higher rate, 23:12 but we do not ever intend to support 400 concurrent connections from the same API key, 23:19 all downloading a package. So it would be bad practice to say all right, well, I want to start 23:24 400 jobs and they all go download 1 million sequences, the same million sequences and they 23:30 sort of in parallel operate on that package. It would be better practice to have a single task, 23:36 to first go get that package of data and then send that data out to the 400 jobs or 23:43 however you share data on your cluster, and they can then operate on the same package. So if we do 23:50 see ... service is going to degrade, if there is that many concurrent connections coming in. 23:56 [Peter Cooper] Okay. Thanks, Brad. And another 24:00 question that was here is about downloading by accession dot version, and Brad chatted 24:05 to me that he didn't think it was possible, so Eric, do you have any comments on that? 24:11 In other words, can you just specify the accession dot version -- I think this is what the person 24:14 wanted. Can you give it an accession.version and use datasets to download something that way? 24:22 [Eric Cox] You can, only with assembly accessions, 24:26 genome assembly accessions, but you can't do that yet for SARS-CoV-2 genomes. 24:32 [Brad Holmes] 24:33 You can also download gene data by RefSeq accession.versions. But a question came 24:41 in while we were looking at viruses, so in terms of the virus genomic RefSeq accession, 24:46 there is not a way to access that data. If that's useful, we'd like to know. 24:50 [Peter Cooper] 24:52 Okay, and there was another -- here's a couple more that have come in. But one of them had to 24:58 do with how quickly sequences that are submitted, I think that's what the question is really about, 25:03 how quickly they are turned around. So we are talking about SARS-CoV-2. This may be a question 25:08 for a group at NCBI. If it is, I'll get the answer from them, too. But how quickly does a sequence 25:15 get from being submitted to being available on datasets for the SARS-CoV-2 sequences? 25:21 [Nuala O'leary] I can answer that. 25:23 So when sequences gets submitted to genbank, they're going to get picked up by NCBI Virus, 25:29 if you're aware of that resource, and we pull them from NCBI Virus. 25:35 Brad may be able to comment on exactly how long it takes, but we do get our virus data set every day, 25:43 so we should pick up anything that's new in NCBI virus. And I believe that also gets updated every 25:48 day, or a couple times a day. And Brad, if you have any more information on that -- 25:53 [Brad Holmes] I don't. It's so many factors involved, 25:56 how quickly the labs turn over the data, how quickly it goes through our submission system, 26:00 how many other submissions we received the same day. We are working to get 26:04 out as fast as possible, 24 hours is certainly possible. But there are no guarantees on that. 26:10 [Peter Cooper] And somebody asked a 26:13 question that's probably worth addressing. You mentioned there are 1.5 million 26:18 genomes for SARS-CoV-2 and the question is, are they all unique or are they 26:23 duplicate sequences isolated from different individuals? And Brad sort of answered that. 26:28 [Nuala O'leary] 26:30 They are unique submissions, but the sequences can be identical. 26:33 [Peter Cooper] Okay, and there was another 26:39 question that -- just a general one about how to find out about webinars and workshops like this. 26:46 And Eric, did you have that slide at the end that had the sort of 26:50 links to the social media and things like that? You kept that in your slide? 26:53 [Eric Cox] Yes, this one? 26:56 [Peter Cooper] ust to point out to you that 27:00 we have social media presence and we announce all these things on social media, 27:06 on Twitter, on Facebook, and on LinkedIn. We also have a blog, NCBI Insights, so if 27:13 you go to that blog and and you have a WordPress account you can subscribe to that blog as well. 27:20 There is also an old-fashioned email list that somebody asking about, and that's 27:27 the NCBI announce email list, so if that's something you want, I could put that URL into 27:34 the chat to everybody and that's how you get to it if you want to subscribe to that email list. 27:39 ncbi-announce@ncbi.nlm.nih.gov 27:40 Okay, and I think we covered all the questions that you have in the chat right now. If there's 27:46 any that we didn't get to answer, I'll make sure that they're in the document. 27:52 Thanks, everybody, for coming. Thanks, Brad, Nuala, and Eric for being here. 27:58 [Eric Cox] Thank you. 27:59 [Peter Cooper] Okay, thanks a lot.