Skip to main content
U.S. flag

An official website of the United States government

Official websites use .gov
A .gov website belongs to an official government organization in the United States.

Secure .gov websites use HTTPS
A lock ( ) or https:// means you’ve safely connected to the .gov website. Share sensitive information only on official, secure websites.

An Introduction to Molecular Evolutionary Analysis with NCBI Datasets and Python

This workshop was offered virtually on July 21, 2022.

Workshop Duration:
 3 hours

Content Difficulty: Intermediate

Target Audience:

This workshop is designed for biologists interested in evolutionary biology nucleotide sequence analsis who have some scripting experience.

Workshop Description:

As they diverge from a common ancestor, species accumulate differences in their DNA sequences. Differences within a protein-coding region are classified in two types. Non-synonymous substitutions change the amino acid sequence of the protein, while synonymous substitutions do not. Synonymous substitutions are largely invisible to natural selection and tend to accumulate at a constant rate. On the other hand, non-synonymous substitutions whose effects are beneficial accumulate at a faster rate, while those that are deleterious are suppressed. By comparing the rates of non-synonymous and synonymous substitutions, we can infer whether natural selection has primarily acted to conserve the protein sequence or to adapt it to a new environment or function.

In this workshop we will compare the protein-coding sequences of two species to estimate which proteins show signs of adaptation. Working in a Jupyter notebook with bash and Python, we will use the NCBI Datasets command line interface (CLI) to download sequence data, then perform analysis with a few popular Python packages. The workshop assumes basic familiarity with a scripting language such as Python or R at a level equivalent to a semester course or programming bootcamp.

In this workshop, you will learn how to:
  • Search for and download protein ortholog sequences with NCBI Datasets CLI
  • Parse the downloaded files with BioPython
  • Identify synonymous and non-synonymous substitutions and calculate substitution rates
  • Plot the results with Matplotlib

Data Access Technology: Jupyter Notebook, Python, Command-line

NCBI Resources: NCBI Datasets

Jupyter notebook button icon

Last Reviewed: August 5, 2022