Fasta file is a text-based file used to store nucleotide or protein
sequences.Inside the file, each sequence begins with a single line
description. The description line starts with a greater than (‘>‘) symbol. Description is followed by lines containing sequence data.
Example:
Reading fasta file using python
In order to do this, we need a python dictionary to store our
sequence data. A python dictionary is an unordered collection of items,
where each item is accessed by referring to its key name.
Example of python dictionary:
my_dictionary = {}
my_dictionary['RestrictionSite'] = 'GAATTC'
print(my_dictionary['RestrictionSite'])
Output:
If you want to know how to read a simple text file in python 3 please click here.
For fasta file, we are going to specify description of sequence as key and sequence as item.
We know that every description start with a ‘>’ symbol. Thus, by
checking if a line starts with ‘>’ symbol, we can easily identify
description line and sequence line.
When a line is found to be start with ‘>’ symbol, the whole line
is considered as a key inside the dictionary. Until a new description
line is found, all those sequence line followed by current description
line is appended to its value.
While processing, the hidden next line character ‘\n’ in the end of
each line is removed. This is accomplished simply by cropping the end
character of each line.
text = "AMINO ACIDS"
first_character_removed = text[1:]
last_character_removed = text[:-1]
print(first_character_removed)
print(last_character_removed)
output:
By combining all these logic, we can read a fasta file into a dictionary using following code:
inputfilename = 'test.fasta'
output_dictionary = {}
with open(inputfilename, 'r') as file:
current_header = ''
for line in file:
if line.startswith('>'):
current_header = line[1:][:-1]
output_dictionary[current_header] = ""
else:
if not(current_header == ''):
if line.endswith('\n'):line = line[:-1]
output_dictionary[current_header] += line
In order to avoid the loss of last sequence when the loop ends, an extra line of code is required outside the loop,
output_dictionary[current_header] += line
make sure we never miss our last sequence.
Here is the link to my GitHub hosted file.