This page is aimed at any developers or coders interesting in
understanding or extending the new Sequence Input/Output interface for
Biopython, SeqIO
.
Note: The details are still subject to change
To add support for reading a new file format, you must implement an
iterator that expects a just file handle and returns SeqRecord
objects.
You may do this using:
Bio.SeqIO.Interfaces
yield
keyword; suitable for
simple formats)SeqRecord
s and then turn it into an iterator
using the iter()
function.You may accept additional optional arguments (an alphabet for example). However there must be one and only one required argument (the input file handle).
What you use as the SeqRecord
’s id
, name
and description
will depend on
the file format. Ideally you would use the accesion number for the id
.
This id should also be unique for each record (unless the records in the
file are in themselves ambiguous).
When storing any annotations in the record’s annotations dictionary follow the defacto standard laid down by the GenBank parser… I should try and document this more.
If the supplied file seems to be invalid, raise a ValueError
exception.
Finally, the new format must be added to the relevant dictionary mapping
in Bio/SeqIO/__init__.py
so that the Bio.SeqIO.parse()
and
Bio.SeqIO.read()
functions are aware of it.
Note: The details are still subject to change
To add support for writing a new file format you should write a sub
class of one of the writer objects in Bio.SeqIO.Interfaces
.
Then, the new format must be added to the relevant dictionary mappings
in Bio/SeqIO/__init__.py
so that the Bio.SeqIO.write()
function
is aware of your code.
If the supplied records cannot be written to this file format, raise a
ValueError
exception. Where appropriate, please use the following
wording:
raise ValueError("Must have at least one sequence")
raise ValueError("Sequences must all be the same length")
raise ValueError("Duplicate record identifier: %s" % ...)
...
ToDo - Defined standard exceptions in Bio.SeqIO
itself?
There are existing parsers in Biopython for the following file formats,
which could be integrated into SeqIO
,
AlignIO
or SearchIO
if
appropriate.
Can Bio.KEGG
parse files in KEGG
format?
Bio.MEME
has a parser for this file format, which at first glance looks
like it could be treated like an alignment format.
Pairwise alignments from the BLAST suite could be turned into a pairwise
Alignment object with Bio.AlignIO
. Is this useful? Sample code on Bug
2560