Wiki Documentation

Introduction to the SeqRecord class.

This page describes the SeqRecord object used in Biopython to hold a sequence (as a Seq object) with identifiers (ID and name), description and optionally annotation and sub-features.

Most of the sequence file format parsers in BioPython can return SeqRecord objects (and may offer a format specific record object too, see for example Bio.SwissProt). The SeqIO system will only return SeqRecord objects.

In addition to the SeqRecord object’s API documentation, there is a whole chapter in the Tutorial (PDF), and the SeqIO page is also very relevant.

Extracting information from a SeqRecord

Lets look in detail at the well annotated SeqRecord objects Biopython creates from a GenBank file, such as ls_orchid.gbk, which we’ll load using the SeqIO module. This file contains 94 records:

from Bio import SeqIO
for index, record in enumerate(SeqIO.parse("ls_orchid.gbk", "genbank")):
    print("index %i, ID = %s, length %i, with %i features"
          % (index, record.id, len(record.seq), len(record.features)))

And this is some of the output. Remember Python likes to count from zero, so the 94 records in this file have been labelled 0 to 93:

index 0, ID = Z78533.1, length 740, with 5 features
index 1, ID = Z78532.1, length 753, with 5 features
index 2, ID = Z78531.1, length 748, with 5 features
...
index 92, ID = Z78440.1, length 744, with 5 features
index 93, ID = Z78439.1, length 592, with 5 features

Lets look in a little more detail at the final record:

>>> print(record)

That should give you a hint of the sort of information held in this object:

ID: Z78439.1
Name: Z78439
Description: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[`<Bio.SeqFeature.Reference ...>`, `<Bio.SeqFeature.Reference ...>`]
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilum barbatum
/gi=2765564
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())

Lets look a little more closely… and use Python’s dir() function to find out more about the SeqRecord object and what it does:

>>> dir(record)
[..., 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'name', 'seq']

If you didn’t already know, the dir() function returns a list of all the methods and properties of an object (as strings). Those starting with underscores in their name are “special” and we’ll be ignoring them in this discussion. We’ll start with the .seq property:

>>> record.seq
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
>>> type(record.seq)
<class 'Bio.Seq.Seq'>

This is a Seq object, another important object type in Biopython, and worth of its own page on the wiki documentation.

The following three properties are all simple strings:

>>> print(record.id)
Z78439.1
>>> print(record.name)
Z78439
>>> print(record.description)
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.

Have a look at the raw GenBank file to see where these came from.

Next, we’ll check the .dxrefs property, which holds any database cross references:

>>> print(record.dbxrefs)
[]
>>> type(record.dbxrefs)
<type 'list'>

An empty list? Disappointing… if we’d used a more recent GenBank file the genome sequencing project reference would show up here.

How about the .annotations property? This is a Python dictionary…

>>> print(record.annotations)
{'sequence_version: 1, 'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
>>> tpye(record.annotations)
<type 'dict'>
>>> print(record.annotations["source"])
Paphiopedilum barbatum

In this case, most of the values in the dictionary are simple strings, but this isn’t always the case - have a look at the references entry for this example - its a list of Reference objects:

>>> type(record.annotations["references"])
<type 'list'>
>>> print(len(record.annotations["references"]))
2
>>> for ref in record.annotations["references"]:
        print(ref.authors)

Cox,A.V., Pridgeon,A.M., Albert,V.A. and Chase,M.W.
Cox,A.V.

Next is .features which is another list property, and it contains SeqFeature objects:

>>> type(record.features)
<type 'list'>
>>> print(len(record.features))
5

SeqFeature objects are complicated enough to warrant their own wiki page… for now please refer to the Tutorial.

If you are using Biopython 1.48 or later, there will be a .format() method. This lets you convert the SeqRecord into a string using one of the output formats supported by Bio.SeqIO, for example:

>>> print(record.format("fasta"))

This should give:

>Z78439.1 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC
ACCCATGGGCATTTGCTGTTGAAGTGACCTAGATTTGCCATCGAGCCTCCTTGGGAGCTT
TCTTGTTGGCGAGATCTAAACCCCTGCCCGGCGGAGTTGGGCGCCAAGTCATATGACACA
TAATTGGTGAAGGGGGTGGTAATCCTGCCCTGACCCTCCCCAAATTATTTTTTTAACAAC
TCTCAGCAACGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAATG
GTGTGAATTGCAGAATCCCGTGAACATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCA
TCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTG
TCCATACATACTGTTCAGCCGGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTGGTACGGG
GGGTCTAAGAGCTGCATGGGCTTTGGATGGTCCTAAATACGGAAAGAGGTGGACGAACTA
TGCTACAACAAAATTGTTGTGCAAATGCCCCGGTTGGCCGTTTAGTTGGGCC

If you are using Biopython 1.50 or later, there will also be a .letter_annotations property. Again this is a dictionary but for per-letter-annotation such as sequence quality scores or secondary structure predictions. This kind of information isn’t found in GenBank files, so in this case the dictionary is empty:

>>> print(record.letter_annotations)
{}

Have a look at FASTQ or QUAL files to see how quality scores are represented. Stockholm (PFAM) alignment files also often include per-letter-annotation.

Creating a SeqRecord object

Most of the time you’ll create SeqRecord objects by parsing a sequence file with Bio.SeqIO. However, it is useful to know how to create a SeqRecord directly. For example,

from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
                       IUPAC.protein),
                   id="YP_025292.1", name="HokC",
                   description="toxic membrane protein, small")
print(record)

This would give the following output:

ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

You could then pass this new record to Bio.SeqIO.write(...) to save it to disk.