Tuesday, December 1, 2020

Working with Biological data: part 1

Fasta file is a text-based file used to store nucleotide or protein sequences.Inside the file, each sequence begins with a single line description. The description line starts with a greater than (‘>‘) symbol. Description is followed by lines containing sequence data.

Example:


Reading fasta file using python


In order to do this, we need a python dictionary to store our sequence data. A python dictionary is an unordered collection of items, where each item is accessed by referring to its key name.

Example of python dictionary:

my_dictionary = {}
my_dictionary['RestrictionSite'] = 'GAATTC'
print(my_dictionary['RestrictionSite'])

Output:
If you want to know how to read a simple text file in python 3 please click here.

For fasta file, we are going to specify description of sequence as key and sequence as item.

We know that every description start with a ‘>’ symbol. Thus, by checking if a line starts with ‘>’ symbol, we can easily identify description line and sequence line.

When a line is found to be start with ‘>’ symbol, the whole line is considered as a key inside the dictionary. Until a new description line is found, all those sequence line followed by current description line is appended to its value.

While processing, the hidden next line character ‘\n’ in the end of each line is removed. This is accomplished simply by cropping the end character of each line.

Example:

text = "AMINO ACIDS"
first_character_removed = text[1:]
last_character_removed = text[:-1]
print(first_character_removed)
print(last_character_removed)

output:

By combining all these logic, we can read a fasta file into a dictionary using following code:

inputfilename = 'test.fasta'                                   
output_dictionary = {}
with open(inputfilename, 'r') as file:                         
    current_header = ''                                        
    for line in file:                                          
        if line.startswith('>'):                               
            current_header = line[1:][:-1]                     
            output_dictionary[current_header] = ""             
        else:
            if not(current_header == ''):                      
                if line.endswith('\n'):line = line[:-1] 
output_dictionary[current_header] += line

In order to avoid the loss of last sequence when the loop ends, an extra line of code is required outside the loop,

output_dictionary[current_header] += line 

make sure we never miss our last sequence.

Here is the link to my GitHub hosted file.

No comments:

Post a Comment

Working with Biological data: part 1

Fasta file is a text-based file used to store nucleotide or protein sequences.Inside the file, each sequence begins with a single line d...