среда, 28 декабря 2016 г.

Finding a Shared Motif (ROSALIND LCSM)

Given: A collection of k (k100   ) DNA strings of length at most 1 kbp each in FASTA format. Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

import re

def main():
 print MainFunc()

def MainFunc():
 f = open('17.txt', 'r')
 string = ''
 strings = []
 for line in f:
  if line[0] == '>':
   if string != '':
    string = string.replace('\n', '')
    strings.append(string)
    string = ''
  else:
   string = string + line
 if string != '':
    string = string.replace('\n', '')
    strings.append(string)
 global s1, s2
 s1 = strings.pop()
 s2 = strings.pop()
 m = []
 m.append(['A', 'C', 'G', 'T'])
 pre_motifs = pre_motif(m)
 len_pre_motifs = len(pre_motifs)
 i = len_pre_motifs - 1
 suit = 1
 while i >= 0:
  for motif in pre_motifs[i]:
   for s in strings:
    if not re.search(motif, s):
     suit = 0
     break
   if suit == 1:
    return motif
  i -= 1
 return 'Found nothing.'

def pre_motif(m):
 letters = ['A', 'C', 'G', 'T']
 lenm = len(m)
 new_m_arr = []
 for head in m[lenm - 1]:
  for letter in letters:
   new_m = head + letter
   match_s1 = re.search(new_m, s1)
   if match_s1:
    match_s2 = re.search(new_m, s2)
    if match_s2:
     new_m_arr.append(new_m)
 if len(new_m_arr) > 0:
  m.append(new_m_arr)
  return pre_motif(m)
 else:
  return m


if __name__ == "__main__":
 main()

понедельник, 19 декабря 2016 г.

Интересная задача на сочетания (решетка)

Пирамидка, в которую играет ребенок, состоит из  дисков, по  каждого размера (диски одного размера неразличимы). Пирамидка устроена таким образом, что на нижнем слое может находиться только самый большой диск, а на верхнем слое только самый маленький. Мы будем называть правильной сборкой пирамидки такую последовательность из семи дисков, что каждый следующий не больше предыдущего, первый диск — самого большого размера, а последний — самого маленького (например,   и 7 — правильные сборки, а нет). Сколько всего существует правильных сборок? Постарайтесь проинтерпретировать данную задачу в терминах путей на решетке.


суббота, 17 декабря 2016 г.

Задачка на множества

Введите первые девять цифр числа перестановок всех букв английского алфавита, которые не содержат в себе подстрок
Используйте принцип включений-исключений и рассуждения, похожие на те, которые мы проводили, когда получали общее количество перестановок.





среда, 14 декабря 2016 г.

Genome Assembly as Shortest Superstring (ROSALIND LONG)

Enother interesting task!
What I do: suppose there is such superstring S. Let the first read coordinate on S be 0. In MainFunc I find all right tails for first read and calculate their coordinates on S. Using recursion, I find everething that goes after first read to the right.
Then I reverse all that rests and do the same. Then I reverse it back and get the "left tail".

Given: At most 50 DNA strings whose length does not exceed 1 kbp in FASTA format (which represent reads deriving from the same strand of a single linear chromosome).

The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.

Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).

def find_middle(len1):
 if len1 % 2 == 0:
  return len1/2 + 1
 else:
  return (len1 + 1)/2

def find_max_middle(len1, len2):
 mlen1 = find_middle(len1)
 mlen2 = find_middle(len2)
 if mlen1 > mlen2:
  return mlen1
 else:
  return mlen2

def MainFunc(find_what_reads, find_where_reads, find_what_positions, res_string):
 find_what_reads1 = []
 find_what_positions1 = []
 find_what_counter = 0
 for find_what in find_what_reads:
  find_where_counter = 0
  for find_where in find_where_reads:
   find_what_len = len(find_what)
   find_where_len = len(find_where)
   max_middle = find_max_middle(find_what_len, find_where_len)
   a = find_what_len - max_middle
   find_what_part = find_what[a:]
   coincidence = find_where.find(find_what_part, 0)
   if coincidence > 0:
    find_where_res_string_position = find_what_positions[find_what_counter] + a - coincidence
    d = find_where_len - len(res_string[find_where_res_string_position:])
    if d > 0:
     res_string = res_string + find_where[-d:]  
    find_where_reads[find_where_counter] =  ''   
    find_what_reads1.append(find_where)
    find_what_positions1.append(find_where_res_string_position)
   find_where_counter += 1
  find_what_counter += 1
 if len(find_what_reads1) > 0:
  res_string = MainFunc(find_what_reads1, find_where_reads, find_what_positions1, res_string)['res_string']
  find_where_reads = MainFunc(find_what_reads1, find_where_reads, find_what_positions1, res_string)['find_where_reads']
 res = {}
 res['res_string'] = res_string
 res['find_where_reads'] = find_where_reads
 return res

def main():
 reads = []
 f = open('16.txt', 'r')
 read = ''
 for line in f:
  if line[0] != '>':
   read = read + line
  else:
   if read != '':
    read_clean = read.replace('\n', '')
    reads.append(read_clean)
   read = ''
 read_clean = read.replace('\n', '')
 reads.append(read_clean)
 res_string = reads[0]
 n = len(res_string)
 res_string_right2left = reads[0]
 find_what_positions = [0]
 find_what_reads = [reads[0]]
 res_left2right = MainFunc(find_what_reads, reads, find_what_positions, res_string)
 string_left2right = res_left2right['res_string']
 find_where_reads_left2right = res_left2right['find_where_reads']
 find_where_reads_right2left = []
 for r in find_where_reads_left2right:
  find_where_reads_right2left.append(r[::-1])
 res_right2left = MainFunc([res_string_right2left[::-1]], find_where_reads_right2left, find_what_positions, res_string_right2left[::-1])
 string_right2left = res_right2left['res_string']
 print string_right2left[n:][::-1] + string_left2right

if __name__ == '__main__':
 main()

суббота, 10 декабря 2016 г.

Speeding Up Motif Finding (ROSALIND KMP)

Given: A DNA string s (of length at most 100 kbp) in FASTA format.
Return: The failure array of s.

f = open('15.txt','r')
pre_s = ''
for line in f:
 pre_s = pre_s + line
f.close()
s = pre_s.replace('\n', '')
len_s = len(s)
first = s[0]
i = 1
P = [0]
matches = [] # list with matches lengths
while i < len_s:
 max_match = 0 # max match length for current string symbol
 j = 0
 len_matches = len(matches)
 while j < len_matches:
  match_num = matches[j]
  if s[i] == s[match_num]:
   match_num_pl1 = match_num + 1
   matches[j] = match_num_pl1
   if match_num_pl1 > max_match:
    max_match = match_num_pl1
  else:
   matches.pop(j)
   j -= 1
   len_matches -= 1
  j += 1
 if s[i] == first:
  matches.append(1)
  if max_match < 1:
   max_match = 1
 i += 1
 P.append(max_match)
res = ''
for k in P:
 res = res + ' ' + str(k)
f1 = open('15_1.txt', 'w')
f1.write(res)
f1.close()
 

Independent Alleles (ROSALIND LIA)

The most interesting problem by now!

Solution:


Code:

from __future__ import division
import sys
import math
from scipy import special

def main():
 if len(sys.argv) > 2:
  k = int(sys.argv[1])
  N = int(sys.argv[2])
  descend_num = pow(2, k)
  i = N
  prob = 0
  while i <= descend_num:
   p = descend_num - i
   prob = prob + special.binom(descend_num, i)*pow(1/4, i)*pow(3/4, p)
   i += 1
  print round(prob, 3)
 else:
  print 'Enter data!'

if __name__ == '__main__':
 main()

среда, 30 ноября 2016 г.

Introduction to Random Strings (ROSALIND PROB)

If I want to find the propobility that string s and some random string are equal, it'll 0.25 to the power of s length. Here I have not a random string, and it's not 0.25, but some other number, which is easy to find from gc-content.
Hint in this problem is very useful.

Given: A DNA string s of length at most 100 bp and an array A
containing at most 20 numbers between 0 and 1.
Return: An array B
having the same length as A in which B[k] represents the common logarithm of the probability that a random string constructed with the GC-content found in A[k] will match s exactly.
Hint: One property of the logarithm function is that for any positive numbers x and y, log10(xy)=log10(x)+log10(y).


from __future__ import division
import re
import sys
import math

def atcg_prob(x):
 cg_prob = float(x)
 at_prob = 1 - cg_prob
 atcg_prob = {}
 atcg_prob['A'] = at_prob / 2
 atcg_prob['T'] = atcg_prob['A']
 atcg_prob['C'] = cg_prob / 2
 atcg_prob['G'] = atcg_prob['C']
 return atcg_prob

def main():
 if len(sys.argv) > 1:
  res = ''
  i = 2
  while i < len(sys.argv):
   cont = atcg_prob(sys.argv[i])
   prob = 0
   for nuk in sys.argv[1]:
    prob = prob + math.log(cont[nuk], 10)
   #res = res + str(math.log(prob, 10)) + ' '
   res = res + str(round(prob, 3)) + ' ' 
   i += 1
  print res
 else:
  print 'Enter datas!'

if __name__ == '__main__':
 main()

вторник, 29 ноября 2016 г.

Хожу на химфак, слушаю лекции.

Хожу на химический факультет, слушаю органику. Многое понимаю, но сделать ничего не могу. Хотя было бы странно, если бы что-то могла, позанимавшись химией всего 60 часов. На занятиях очень интересно, прям как в кино. Тут пример того, что рассказывают.
Как получить фенолфталеин из фенола и фталевого ангидрида.
Вообще не очень понятно, почему атакуется только верхний кислород, а нижний нет. Но такая уж она химия - волшебная наука.

А тут показано почему фенолфталеин меняет окраску в щелочной среде. Как видно, от щелочи появляется много двойных связей по всей молекуле. Они то и меняет оптические свойства, так что щелочной среде он становится малиновым. Кстати, в кислой он розовый. 



вторник, 22 ноября 2016 г.

Overlap Graphs (ROSALIND GRPH)

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
Return: The adjacency list corresponding to O3. You may return edges in any order.

import re

f = open('12.txt', 'r')
reads = re.findall(r'(Rosalind_[0-9]+)\n(([A-T]+\n)+)', f.read())
suffix = {}
prefix = {}
for s in reads:
 string = s[1].replace('\n', '')
 if len(string) > 3:
  head = string[:3]
  if head in prefix:
   prefix[head].append(s[0])
  else:
   prefix[head] = [s[0]]
  tail = string[-3:]
  if tail in suffix:
   suffix[tail].append(s[0])
  else:
   suffix[tail] = [s[0]]
for rec in suffix:
 if rec in prefix:
  i = 0
  while i < len(suffix[rec]):
   j = 0
   while j < len(prefix[rec]):
    if suffix[rec][i] != prefix[rec][j]:
     print suffix[rec][i] + ' ' + prefix[rec][j]
    j += 1
   i += 1

суббота, 19 ноября 2016 г.

Consensus and Profile (ROSALIND CONS)

numpy helps!
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

import numpy as np
import re

f = open('11.txt', 'r')
strings = re.findall(r'(>Rosalind_[0-9]+)\n(([A-T]+\n)+)', f.read())
str_len = len(strings[0][1].replace('\n', ''))
profile = np.zeros((4, str_len))
for s in strings:
 counter = 0
 str_data = np.zeros((4, str_len))
 st = s[1].replace('\n', '')
 for i in st:
  if i == 'A':
   str_data[0,counter] = 1
  elif i == 'C':
   str_data[1,counter] = 1
  elif i == 'G':
   str_data[2,counter] = 1
  elif i == 'T':
   str_data[3,counter] = 1
  counter += 1
 profile = profile + str_data
consensus = ''
position = 0
while position < str_len:
 column = profile[:, position]
 column_max = column.max()
 nucleotides = ['A', 'C', 'G', 'T']
 i = 0
 while i < 4:
  if column[i] == column_max:
   consensus = consensus + nucleotides[i]
   break
  i += 1
 position += 1
print consensus
A_line = ''
C_line = ''
G_line = ''
T_line = ''
j = 0
while j < str_len:
 A_line = A_line + str(int(profile[0, j])) + ' '
 C_line = C_line + str(int(profile[1, j])) + ' '
 G_line = G_line + str(int(profile[2, j])) + ' '
 T_line = T_line + str(int(profile[3, j])) + ' '
 j += 1
print 'A: ' + A_line
print 'C: ' + C_line
print 'G: ' + G_line
print 'T: ' + T_line

пятница, 18 ноября 2016 г.

Translating RNA into Protein (ROSALIND PROT)

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).
Return: The protein string encoded by s.

import sys

def codon_table():
 nukaa = {}
 nukaa['UUU'] = 'F'
 nukaa['CUU'] = 'L'
 nukaa['AUU'] = 'I'
 nukaa['GUU'] = 'V'
 nukaa['UUC'] = 'F'
 nukaa['CUC'] = 'L'
 nukaa['AUC'] = 'I'
 nukaa['GUC'] = 'V'
 nukaa['UUA'] = 'L'
 nukaa['CUA'] = 'L'
 nukaa['AUA'] = 'I'
 nukaa['GUA'] = 'V'
 nukaa['UUG'] = 'L'
 nukaa['CUG'] = 'L'
 nukaa['AUG'] = 'M'
 nukaa['GUG'] = 'V'
 nukaa['UCU'] = 'S'
 nukaa['CCU'] = 'P'
 nukaa['ACU'] = 'T'
 nukaa['GCU'] = 'A'
 nukaa['UCC'] = 'S'
 nukaa['CCC'] = 'P'
 nukaa['ACC'] = 'T'
 nukaa['GCC'] = 'A'
 nukaa['UCA'] = 'S'
 nukaa['CCA'] = 'P'
 nukaa['ACA'] = 'T'
 nukaa['GCA'] = 'A'
 nukaa['UCG'] = 'S'
 nukaa['CCG'] = 'P'
 nukaa['ACG'] = 'T'
 nukaa['GCG'] = 'A'
 nukaa['UAU'] = 'Y'
 nukaa['CAU'] = 'H'
 nukaa['AAU'] = 'N'
 nukaa['GAU'] = 'D'
 nukaa['UAC'] = 'Y'
 nukaa['CAC'] = 'H'
 nukaa['AAC'] = 'N'
 nukaa['GAC'] = 'D'
 nukaa['CAA'] = 'Q'
 nukaa['AAA'] = 'K'
 nukaa['GAA'] = 'E'
 nukaa['CAG'] = 'Q'
 nukaa['AAG'] = 'K'
 nukaa['GAG'] = 'E'
 nukaa['UGU'] = 'C'
 nukaa['CGU'] = 'R'
 nukaa['AGU'] = 'S'
 nukaa['GGU'] = 'G'
 nukaa['UGC'] = 'C'
 nukaa['CGC'] = 'R'
 nukaa['AGC'] = 'S'
 nukaa['GGC'] = 'G'
 nukaa['CGA'] = 'R'
 nukaa['AGA'] = 'R'
 nukaa['GGA'] = 'G'
 nukaa['UGG'] = 'W'
 nukaa['CGG'] = 'R'
 nukaa['AGG'] = 'R'
 nukaa['GGG'] = 'G'
 return nukaa


def main():
 if len(sys.argv) > 1:
  rna_string = sys.argv[1]
  nukaa = codon_table()
  nucl_str_len = len(rna_string) // 3
  nucl_string = ''
  i = 0
  while i < nucl_str_len:
   codon = rna_string[3*i:3*(i+1)]
   if codon=='UAA' or codon=='UAG' or codon=='UGA':
    break
   else:
     nucl_string += nukaa[codon]
   i+=1
  print nucl_string
 else:
  print 'Enter RNA sequence.'

if __name__ == '__main__':
 main()

Finding a Motif in DNA (ROSALIND SUBS)

Given: Two DNA strings s and t (each of length at most 1 kbp).
Return: All locations of t as a substring of s.

s = 'TCACTCGATTGGAACCGA'
t = 'CACTCGACA'
occurence = []
start_position = 0
while True:
 start_position = s.find(t, start_position+1)
 if start_position == -1:
  break
 occurence.append(start_position+1)
res = ''
for i in occurence:
 res += str(i) + ' '
print res


среда, 16 ноября 2016 г.

Mendel's First Love (ROSALIND IPRB)

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.
Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

import sys

def main():
 if len(sys.argv) > 1:
  k = int(sys.argv[1])
  m = int(sys.argv[2])
  n = int(sys.argv[3])
  T = k + m + n
  Z = T*(T - 1)
  print round(1-(n*(n-1) + 0.25*m*(m-1) + m*n)/Z, 5)
 else:
  print 'Enter k, m, n.'

if __name__ == '__main__':
 main()

Counting Point Mutations (ROSALIND HAMM)

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).
Return: The Hamming distance dH(s,t).

import sys
def main():
 if len(sys.argv) > 1:
  s1 = sys.argv[1]
  s2 = sys.argv[2]
  diff_counter = 0
  i = 0
  while i < len(s1):
   if s1[i] != s2[i]:
    diff_counter += 1
   i += 1
  print diff_counter
 else:
  print 'Enter 2 strings.'

if __name__ == '__main__':
 main()

Computing GC Content (ROSALIND GC)

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

It was much easier to solve this with regular expressions, no cyrcles.
It doesn't work without from __future__ import division because I have python2.7.

from __future__ import division
import re

def cg_content(s):
 CG = len(re.findall(r'C|G', s))
 AT = len(re.findall(r'A|T', s))
 ACGT = CG + AT
 return round(100*CG/ACGT, 6)

f = open('6.txt', 'r')
strings = re.findall(r'(Rosalind_.[0-9]+)\n(([A-Z]+\n)+)', f.read())
max_cg_content = {}
max_cg_content['id_string'] = ''
max_cg_content['string'] = ''
max_cg_content['cg_content'] = 0
for s in strings:
 string_cg_content = cg_content(s[1])
 if string_cg_content > max_cg_content['cg_content']:
  max_cg_content['id_string'] = s[0]
  max_cg_content['string'] = s[1]
  max_cg_content['cg_content'] = string_cg_content
print max_cg_content['id_string']
print max_cg_content['cg_content']

воскресенье, 13 ноября 2016 г.

ROSALIND FIBD

Given: Positive integers n≤100 and m≤20.
Return: The total number of pairs of rabbits that will remain after the n
-th month if all rabbits live for m months.

import sys

def F(n, m):
 if n < 2:
  return 1
 else:
  Old = [0, 0, 1]
  New = [0, 1, 0]
  month = 3
  while month <= n:
   born_to_die = month - m - 1
   if born_to_die < 0:
    Died = 0
   else:
    Died = New[born_to_die]
   Old_add = Old[month - 1] + New[month - 1] - Died
   New_add = Old[month - 1] - Died
   Old.append(Old_add)
   New.append(New_add)
   month+=1
  return Old[n] + New[n] - New[n - m]

def main():
 if len(sys.argv) > 1:
  print F(int(sys.argv[1]), int(sys.argv[2]))
 else:
  print 'Enter n and m.'

if __name__ == '__main__':
 main()

суббота, 12 ноября 2016 г.

ROSALIND FIB

Given: Positive integers n≤40 and k≤5.
Return: The total number of rabbit pairs that will be present after n
months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

import sys

def F(n, k):
 Fn_2 = 1
 Fn_1 = k+1
 if n > 1:
  i = 3
  while i < n:
   Fn = Fn_1 + k*Fn_2
   Fn_2 = Fn_1
   Fn_1 = Fn
   i+=1
  return Fn
 elif n == 1:
  return 1
 elif n == 2:
  return k+1

def main():
 if len(sys.argv) > 1:
  print F(int(sys.argv[1]), int(sys.argv[2]))
 else:
  print 'Enter k and n.'

if __name__ == '__main__':
 main()

четверг, 10 ноября 2016 г.

ROSALIND REVC

Given: A DNA string s of length at most 1000 bp.
Return: The reverse complement of s.

import sys
import re

def rev_comp(s):
 res = ''
 for i in s:
  if i == 'A':
   res += 'T'
  elif i == 'G':
   res += 'C'
  elif i == 'T':
   res += 'A'
  elif i == 'C':
   res += 'G'
 return res[::-1]

def main():
 if len(sys.argv) > 1:
  print rev_comp(sys.argv[1])
 else:
  print 'Enter your sequence!'

if __name__ == '__main__':
 main()

ROSALIND RNA

Given: A DNA string t having length at most 1000 nt.
Return: The transcribed RNA string of t.

import sys
import re

def main():
 if len(sys.argv) > 1:
  print re.sub(r'T', r'U', sys.argv[1])
 else:
  print 'Enter your sequence!'

if __name__ == '__main__':
 main()

ROSALIND DNA

Given: A DNA string s of length at most 1000 nt.
Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

I've wrote even two functions to count lettes:

import sys
import re

def cycle(s):
 a_num = 0
 t_num = 0
 g_num = 0
 c_num = 0
 for i in s:
  if i == 'A':
   a_num += 1
  elif i == 'T':
   t_num += 1
  elif i == 'G':
   g_num += 1
  elif i == 'C':
   c_num += 1
 return str(a_num) + ' ' + str(c_num) + ' ' + str(g_num) + ' ' + str(t_num)

def reg_expr(s):
 A = re.findall(r'A', s)
 C = re.findall(r'C', s)
 G = re.findall(r'G', s)
 T = re.findall(r'T', s)
 return str(len(A)) + ' ' + str(len(C)) + ' ' + str(len(G)) + ' ' + str(len(T))

def main():
 if len(sys.argv) > 1:
  print reg_expr(sys.argv[1])   #cycle(sys.argv[1])
 else:
  print 'Enter your sequence!'

if __name__ == '__main__':
 main()

Развлекательныне сайты по биоинформатике

Сайты для биоинформатика

Информация по геномам NCBI

Статьи по биомедицине PubMed

Шкалы Phred 33 Phred 64