An Introduction to Molecular Evolutionary Analysis with NCBI Datasets and Python

This workshop was offered virtually on July 21, 2022.

Workshop Duration: 3 hours

Content Difficulty: Intermediate

Target Audience:

This workshop is designed for biologists interested in evolutionary biology nucleotide sequence analsis who have some scripting experience.

Workshop Description:

As they diverge from a common ancestor, species accumulate differences in their DNA sequences. Differences within a protein-coding region are classified in two types. Non-synonymous substitutions change the amino acid sequence of the protein, while synonymous substitutions do not. Synonymous substitutions are largely invisible to natural selection and tend to accumulate at a constant rate. On the other hand, non-synonymous substitutions whose effects are beneficial accumulate at a faster rate, while those that are deleterious are suppressed. By comparing the rates of non-synonymous and synonymous substitutions, we can infer whether natural selection has primarily acted to conserve the protein sequence or to adapt it to a new environment or function.

In this workshop we will compare the protein-coding sequences of two species to estimate which proteins show signs of adaptation. Working in a Jupyter notebook with bash and Python, we will use the NCBI Datasets command line interface (CLI) to download sequence data, then perform analysis with a few popular Python packages. The workshop assumes basic familiarity with a scripting language such as Python or R at a level equivalent to a semester course or programming bootcamp.

In this workshop, you will learn how to:

Search for and download protein ortholog sequences with NCBI Datasets CLI
Parse the downloaded files with BioPython
Identify synonymous and non-synonymous substitutions and calculate substitution rates
Plot the results with Matplotlib

Data Access Technology: Jupyter Notebook, Python, Command-line

NCBI Resources: NCBI Datasets

Last Reviewed: August 5, 2022