This page describes the SeqRecord
object used in Biopython to hold a
sequence (as a Seq
object) with identifiers (ID and
name), description and optionally annotation and sub-features.
Most of the sequence file format parsers in BioPython can return
SeqRecord
objects (and may offer a format specific record object
too, see for example Bio.SwissProt
). The SeqIO
system will only return SeqRecord
objects.
In addition to the SeqRecord
object’s API
documentation,
there is a whole chapter in the
Tutorial
(PDF), and the
SeqIO
page is also very relevant.
Lets look in detail at the well annotated SeqRecord
objects
Biopython creates from a GenBank file, such as
ls_orchid.gbk
,
which we’ll load using the SeqIO
module. This file
contains 94 records:
from Bio import SeqIO
for index, record in enumerate(SeqIO.parse("ls_orchid.gbk", "genbank")):
print("index %i, ID = %s, length %i, with %i features"
% (index, record.id, len(record.seq), len(record.features)))
And this is some of the output. Remember Python likes to count from zero, so the 94 records in this file have been labelled 0 to 93:
index 0, ID = Z78533.1, length 740, with 5 features
index 1, ID = Z78532.1, length 753, with 5 features
index 2, ID = Z78531.1, length 748, with 5 features
...
index 92, ID = Z78440.1, length 744, with 5 features
index 93, ID = Z78439.1, length 592, with 5 features
Lets look in a little more detail at the final record:
>>> print(record)
That should give you a hint of the sort of information held in this object:
ID: Z78439.1
Name: Z78439
Description: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/source=Paphiopedilum barbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ..., 'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed spacer', 'ITS1', 'ITS2']
/references=[`<Bio.SeqFeature.Reference ...>`, `<Bio.SeqFeature.Reference ...>`]
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilum barbatum
/gi=2765564
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
Lets look a little more closely… and use Python’s dir()
function
to find out more about the SeqRecord
object and what it does:
>>> dir(record)
[..., 'annotations', 'dbxrefs', 'description', 'features', 'format', 'id', 'letter_annotations', 'name', 'seq']
If you didn’t already know, the dir()
function returns a list of all
the methods and properties of an object (as strings). Those starting
with underscores in their name are “special” and we’ll be ignoring them
in this discussion. We’ll start with the .seq
property:
>>> record.seq
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC ...', IUPACAmbiguousDNA())
>>> type(record.seq)
<class 'Bio.Seq.Seq'>
This is a Seq
object, another important object type in
Biopython, and worth of its own page on the wiki documentation.
The following three properties are all simple strings:
>>> print(record.id)
Z78439.1
>>> print(record.name)
Z78439
>>> print(record.description)
P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Have a look at the raw GenBank file to see where these came from.
Next, we’ll check the .dxrefs
property, which holds any database
cross references:
>>> print(record.dbxrefs)
[]
>>> type(record.dbxrefs)
<type 'list'>
An empty list? Disappointing… if we’d used a more recent GenBank file the genome sequencing project reference would show up here.
How about the .annotations
property? This is a Python dictionary…
>>> print(record.annotations)
{'sequence_version: 1, 'source': 'Paphiopedilum barbatum', 'taxonomy': ...}
>>> tpye(record.annotations)
<type 'dict'>
>>> print(record.annotations["source"])
Paphiopedilum barbatum
In this case, most of the values in the dictionary are simple strings,
but this isn’t always the case - have a look at the references entry for
this example - its a list of Reference
objects:
>>> type(record.annotations["references"])
<type 'list'>
>>> print(len(record.annotations["references"]))
2
>>> for ref in record.annotations["references"]:
print(ref.authors)
Cox,A.V., Pridgeon,A.M., Albert,V.A. and Chase,M.W.
Cox,A.V.
Next is .features
which is another list property, and it contains
SeqFeature
objects:
>>> type(record.features)
<type 'list'>
>>> print(len(record.features))
5
SeqFeature
objects are complicated enough to warrant their own wiki
page… for now please refer to the Tutorial.
If you are using Biopython 1.48 or later, there will be a .format()
method. This lets you convert the SeqRecord
into a string using one
of the output formats supported by Bio.SeqIO
, for
example:
>>> print(record.format("fasta"))
This should give:
>Z78439.1 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC
ACCCATGGGCATTTGCTGTTGAAGTGACCTAGATTTGCCATCGAGCCTCCTTGGGAGCTT
TCTTGTTGGCGAGATCTAAACCCCTGCCCGGCGGAGTTGGGCGCCAAGTCATATGACACA
TAATTGGTGAAGGGGGTGGTAATCCTGCCCTGACCCTCCCCAAATTATTTTTTTAACAAC
TCTCAGCAACGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAATG
GTGTGAATTGCAGAATCCCGTGAACATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCA
TCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTG
TCCATACATACTGTTCAGCCGGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTGGTACGGG
GGGTCTAAGAGCTGCATGGGCTTTGGATGGTCCTAAATACGGAAAGAGGTGGACGAACTA
TGCTACAACAAAATTGTTGTGCAAATGCCCCGGTTGGCCGTTTAGTTGGGCC
If you are using Biopython 1.50 or later, there will also be a
.letter_annotations
property. Again this is a dictionary but for
per-letter-annotation such as sequence quality scores or secondary
structure predictions. This kind of information isn’t found in GenBank
files, so in this case the dictionary is empty:
>>> print(record.letter_annotations)
{}
Have a look at FASTQ or QUAL files to see how quality scores are represented. Stockholm (PFAM) alignment files also often include per-letter-annotation.
Most of the time you’ll create SeqRecord
objects by parsing a
sequence file with Bio.SeqIO
. However, it is useful
to know how to create a SeqRecord
directly. For example,
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import IUPAC
record = SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein, small")
print(record)
This would give the following output:
ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
You could then pass this new record to
Bio.SeqIO.write(...)
to save it to disk.