The XML output of NCBI’s stand alone BLAST programs does not include
information on query sequences that have ‘no hits’ in the target
database. Sometimes
you want to know which sequences don’t have match a database and further
analyse/anotate them accordingly. There are a number of different ways
to do this, one is to use SeqIO’s method .index()
to
turn the query file into a dictionary then parse the results file to get
the sequences that did match the dictionary. You can then use Python’s
set()
arithmetic to make a list of sequences that are in the query file
and not the results which can be used as keys to retrieve the complete
SeqRecord
for each of the “no hit” queries. Got
it? Well, perhaps it’s easier to just do it:
Let’s presume you set up a BLAST run with the sequences in a file called
queries.fasta
searched against a database, with the results saved to
BLAST_RESULTS.xml
from Bio import SeqIO
from Bio.Blast import NCBIXML
# Build an index, but we don't need to parse the record
q_dict = SeqIO.index("queries.fasta", "fasta")
hits = []
for record in NCBIXML.parse(open("BLAST_RESULTS.xml")):
# As of BLAST 2.2.19 the xml output for multiple-query blast searches
# skips queries with no hits so we could just append the ID of every blast
# record to our 'hit list'. It's possible that the NCBI will change this
# behaviour in the future so let's make the appending conditional on there
# being some hits (ie, a non-empty alignments list) recorded in the blast
# record
if record.alignments:
# The blast record's 'query' contains the sequences description as a
# string. We used the ID as the key in our dictionary so we'll need to
# split the string and get the first field to remove the right entries
hits.append(record.query.split()[0])
misses = set(q_dict.keys()) - set(hits)
orphan_records = [q_dict[name] for name in misses]
We can do a little sanity check to make sure everything worked OK:
>>> print("found %i records in query, %i have hits, making %i misses"
% (len(q_dict), len(hits), len(misses)))
found 11955 records in query, 2802 have hits, making 9153 misses
Good, now you have all the ‘not hits’ sequence in a list
(orphan_records
) of SeqRecord
objects you can annotate/analyse as you
please or use SeqIO.write()
to make a new file of
just these sequences that can be put through another program.
As implemented above most of the time in each run is spend populating the list of hits from the BLAST parser, would checking each record from the results file against the dictionary one at a time be a less memory intensive way to go in case of very large files?