суббота, 19 ноября 2016 г.

Consensus and Profile (ROSALIND CONS)

numpy helps!
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

import numpy as np
import re

f = open('11.txt', 'r')
strings = re.findall(r'(>Rosalind_[0-9]+)\n(([A-T]+\n)+)', f.read())
str_len = len(strings[0][1].replace('\n', ''))
profile = np.zeros((4, str_len))
for s in strings:
 counter = 0
 str_data = np.zeros((4, str_len))
 st = s[1].replace('\n', '')
 for i in st:
  if i == 'A':
   str_data[0,counter] = 1
  elif i == 'C':
   str_data[1,counter] = 1
  elif i == 'G':
   str_data[2,counter] = 1
  elif i == 'T':
   str_data[3,counter] = 1
  counter += 1
 profile = profile + str_data
consensus = ''
position = 0
while position < str_len:
 column = profile[:, position]
 column_max = column.max()
 nucleotides = ['A', 'C', 'G', 'T']
 i = 0
 while i < 4:
  if column[i] == column_max:
   consensus = consensus + nucleotides[i]
   break
  i += 1
 position += 1
print consensus
A_line = ''
C_line = ''
G_line = ''
T_line = ''
j = 0
while j < str_len:
 A_line = A_line + str(int(profile[0, j])) + ' '
 C_line = C_line + str(int(profile[1, j])) + ' '
 G_line = G_line + str(int(profile[2, j])) + ' '
 T_line = T_line + str(int(profile[3, j])) + ' '
 j += 1
print 'A: ' + A_line
print 'C: ' + C_line
print 'G: ' + G_line
print 'T: ' + T_line

Комментариев нет:

Отправить комментарий