Tuesday, December 1, 2020

Working with Biological data: part 1

Fasta file is a text-based file used to store nucleotide or protein sequences.Inside the file, each sequence begins with a single line description. The description line starts with a greater than (‘>‘) symbol. Description is followed by lines containing sequence data.

Example:


Reading fasta file using python


In order to do this, we need a python dictionary to store our sequence data. A python dictionary is an unordered collection of items, where each item is accessed by referring to its key name.

Example of python dictionary:

my_dictionary = {}
my_dictionary['RestrictionSite'] = 'GAATTC'
print(my_dictionary['RestrictionSite'])

Output:
If you want to know how to read a simple text file in python 3 please click here.

For fasta file, we are going to specify description of sequence as key and sequence as item.

We know that every description start with a ‘>’ symbol. Thus, by checking if a line starts with ‘>’ symbol, we can easily identify description line and sequence line.

When a line is found to be start with ‘>’ symbol, the whole line is considered as a key inside the dictionary. Until a new description line is found, all those sequence line followed by current description line is appended to its value.

While processing, the hidden next line character ‘\n’ in the end of each line is removed. This is accomplished simply by cropping the end character of each line.

Example:

text = "AMINO ACIDS"
first_character_removed = text[1:]
last_character_removed = text[:-1]
print(first_character_removed)
print(last_character_removed)

output:

By combining all these logic, we can read a fasta file into a dictionary using following code:

inputfilename = 'test.fasta'                                   
output_dictionary = {}
with open(inputfilename, 'r') as file:                         
    current_header = ''                                        
    for line in file:                                          
        if line.startswith('>'):                               
            current_header = line[1:][:-1]                     
            output_dictionary[current_header] = ""             
        else:
            if not(current_header == ''):                      
                if line.endswith('\n'):line = line[:-1] 
output_dictionary[current_header] += line

In order to avoid the loss of last sequence when the loop ends, an extra line of code is required outside the loop,

output_dictionary[current_header] += line 

make sure we never miss our last sequence.

Here is the link to my GitHub hosted file.

Saturday, April 25, 2020

Python: Reading File


 

Python programming language is simple to use. Using python, we can simply read a text file using open() function. open() function takes two arguments. Through first argument, we specify the file name. Via second argument, we specify the mode to open the file. Available modes are:

CharacterMeaning
'r'open for reading (default)
'w'open for writing, truncating the file first
'x'open for exclusive creation, failing if the file already exists
'a'open for writing, appending to the end of the file if it exists
'b'binary mode
't'text mode (default)
'+'open for updating (reading and writing)

Let's read a simple text file using following few lines of code:

file = open("location/to/file.txt",'r')
for line in file:
    print(line)
file.close() 

In this code, variable 'file' is used to store data from the file. A for loop was used to read file line by line. Variable 'line' is used to represent each line during the loop. During the loop, each line inside the file were printed using built-in print() function. After the reading completed, opened file was closed using close() function.
This can also be done in following way:

with open("location/to/file.txt",'r') as file:
    for line in file:
        print(line)

The advantage of above code is that, it automatically close the file once the processing is done.




Working with Biological data: part 1

Fasta file is a text-based file used to store nucleotide or protein sequences.Inside the file, each sequence begins with a single line d...