Естественные науки и программирование: 2016

среда, 28 декабря 2016 г.

Finding a Shared Motif (ROSALIND LCSM)

Given: A collection of

k

(

k \leq 100

) DNA strings of length at most 1 kbp each in FASTA format.

k \leq 100

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

import re

def main():
print MainFunc()

def MainFunc():
f = open('17.txt', 'r')
string = ''
strings = []
for line in f:
if line[0] == '>':
   if string != '':
    string = string.replace('\n', '')
    strings.append(string)
    string = ''
else:
   string = string + line
if string != '':
    string = string.replace('\n', '')
    strings.append(string)
global s1, s2
s1 = strings.pop()
s2 = strings.pop()
m = []
m.append(['A', 'C', 'G', 'T'])
pre_motifs = pre_motif(m)
len_pre_motifs = len(pre_motifs)
i = len_pre_motifs - 1
suit = 1
while i >= 0:
for motif in pre_motifs[i]:
   for s in strings:
    if not re.search(motif, s):
     suit = 0
     break
   if suit == 1:
    return motif
i -= 1
return 'Found nothing.'

def pre_motif(m):
letters = ['A', 'C', 'G', 'T']
lenm = len(m)
new_m_arr = []
for head in m[lenm - 1]:
for letter in letters:
   new_m = head + letter
   match_s1 = re.search(new_m, s1)
   if match_s1:
    match_s2 = re.search(new_m, s2)
    if match_s2:
     new_m_arr.append(new_m)
if len(new_m_arr) > 0:
m.append(new_m_arr)
return pre_motif(m)
else:
return m

if __name__ == "__main__":
main()

понедельник, 19 декабря 2016 г.

Интересная задача на сочетания (решетка)

Пирамидка, в которую играет ребенок, состоит из

49

дисков, по

7

каждого размера (диски одного размера неразличимы). Пирамидка устроена таким образом, что на нижнем слое может находиться только самый большой диск, а на верхнем слое только самый маленький. Мы будем называть правильной сборкой пирамидки такую последовательность из семи дисков, что каждый следующий не больше предыдущего, первый диск — самого большого размера, а последний — самого маленького (например,

7654321,

7775331,

и 7

7222211

— правильные сборки, а

7654322

-

нет). Сколько всего существует правильных сборок? Постарайтесь проинтерпретировать данную задачу в терминах путей на решетке.

суббота, 17 декабря 2016 г.

Задачка на множества

Введите первые девять цифр числа перестановок всех букв английского алфавита, которые не содержат в себе подстрок

f i s h,

r a t

Используйте принцип включений-исключений и рассуждения, похожие на те, которые мы проводили, когда получали общее количество перестановок.

среда, 14 декабря 2016 г.

Genome Assembly as Shortest Superstring (ROSALIND LONG)

Enother interesting task!
What I do: suppose there is such superstring S. Let the first read coordinate on S be 0. In MainFunc I find all right tails for first read and calculate their coordinates on S. Using recursion, I find everething that goes after first read to the right.
Then I reverse all that rests and do the same. Then I reverse it back and get the "left tail".

Given: At most 50 DNA strings whose length does not exceed 1 kbp in FASTA format (which represent reads deriving from the same strand of a single linear chromosome).

The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.

Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).

def find_middle(len1):
if len1 % 2 == 0:
return len1/2 + 1
else:
return (len1 + 1)/2

def find_max_middle(len1, len2):
mlen1 = find_middle(len1)
mlen2 = find_middle(len2)
if mlen1 > mlen2:
return mlen1
else:
return mlen2

def MainFunc(find_what_reads, find_where_reads, find_what_positions, res_string):
find_what_reads1 = []
find_what_positions1 = []
find_what_counter = 0
for find_what in find_what_reads:
find_where_counter = 0
for find_where in find_where_reads:
   find_what_len = len(find_what)
   find_where_len = len(find_where)
   max_middle = find_max_middle(find_what_len, find_where_len)
   a = find_what_len - max_middle
   find_what_part = find_what[a:]
   coincidence = find_where.find(find_what_part, 0)
   if coincidence > 0:
    find_where_res_string_position = find_what_positions[find_what_counter] + a - coincidence
    d = find_where_len - len(res_string[find_where_res_string_position:])
    if d > 0:
     res_string = res_string + find_where[-d:]
    find_where_reads[find_where_counter] = ''
    find_what_reads1.append(find_where)
    find_what_positions1.append(find_where_res_string_position)
   find_where_counter += 1
find_what_counter += 1
if len(find_what_reads1) > 0:
res_string = MainFunc(find_what_reads1, find_where_reads, find_what_positions1, res_string)['res_string']
find_where_reads = MainFunc(find_what_reads1, find_where_reads, find_what_positions1, res_string)['find_where_reads']
res = {}
res['res_string'] = res_string
res['find_where_reads'] = find_where_reads
return res

def main():
reads = []
f = open('16.txt', 'r')
read = ''
for line in f:
if line[0] != '>':
   read = read + line
else:
   if read != '':
    read_clean = read.replace('\n', '')
    reads.append(read_clean)
   read = ''
read_clean = read.replace('\n', '')
reads.append(read_clean)
res_string = reads[0]
n = len(res_string)
res_string_right2left = reads[0]
find_what_positions = [0]
find_what_reads = [reads[0]]
res_left2right = MainFunc(find_what_reads, reads, find_what_positions, res_string)
string_left2right = res_left2right['res_string']
find_where_reads_left2right = res_left2right['find_where_reads']
find_where_reads_right2left = []
for r in find_where_reads_left2right:
find_where_reads_right2left.append(r[::-1])
res_right2left = MainFunc([res_string_right2left[::-1]], find_where_reads_right2left, find_what_positions, res_string_right2left[::-1])
string_right2left = res_right2left['res_string']
print string_right2left[n:][::-1] + string_left2right

if __name__ == '__main__':
main()

суббота, 10 декабря 2016 г.

Speeding Up Motif Finding (ROSALIND KMP)

Given: A DNA string s (of length at most 100 kbp) in FASTA format.
Return: The failure array of s.

f = open('15.txt','r')
pre_s = ''
for line in f:
pre_s = pre_s + line
f.close()
s = pre_s.replace('\n', '')
len_s = len(s)
first = s[0]
i = 1
P = [0]
matches = [] # list with matches lengths
while i < len_s:
max_match = 0 # max match length for current string symbol
j = 0
len_matches = len(matches)
while j < len_matches:
match_num = matches[j]
if s[i] == s[match_num]:
   match_num_pl1 = match_num + 1
   matches[j] = match_num_pl1
   if match_num_pl1 > max_match:
    max_match = match_num_pl1
else:
   matches.pop(j)
   j -= 1
   len_matches -= 1
j += 1
if s[i] == first:
matches.append(1)
if max_match < 1:
   max_match = 1
i += 1
P.append(max_match)
res = ''
for k in P:
res = res + ' ' + str(k)
f1 = open('15_1.txt', 'w')
f1.write(res)
f1.close()

Independent Alleles (ROSALIND LIA)

The most interesting problem by now!

Solution:

Code:

from __future__ import division
import sys
import math
from scipy import special

def main():
if len(sys.argv) > 2:
k = int(sys.argv[1])
N = int(sys.argv[2])
descend_num = pow(2, k)
i = N
prob = 0
while i <= descend_num:
   p = descend_num - i
   prob = prob + special.binom(descend_num, i)*pow(1/4, i)*pow(3/4, p)
   i += 1
print round(prob, 3)
else:
print 'Enter data!'

if __name__ == '__main__':
main()

среда, 30 ноября 2016 г.

Introduction to Random Strings (ROSALIND PROB)

If I want to find the propobility that string s and some random string are equal, it'll 0.25 to the power of s length. Here I have not a random string, and it's not 0.25, but some other number, which is easy to find from gc-content.
Hint in this problem is very useful.

Given: A DNA string

s

of length at most 100 bp and an array

A

containing at most 20 numbers between 0 and 1.
Return: An array

B

having the same length as

A

in which

B [k]

represents the common logarithm of the probability that a random string constructed with the GC-content found in

A [k]

will match

s

exactly.
Hint: One property of the logarithm function is that for any positive numbers

x

and

y

\log_{10} (x \cdot y) = \log_{10} (x) + \log_{10} (y)

.

from __future__ import division
import re
import sys
import math

def atcg_prob(x):
cg_prob = float(x)
at_prob = 1 - cg_prob
atcg_prob = {}
atcg_prob['A'] = at_prob / 2
atcg_prob['T'] = atcg_prob['A']
atcg_prob['C'] = cg_prob / 2
atcg_prob['G'] = atcg_prob['C']
return atcg_prob

def main():
if len(sys.argv) > 1:
res = ''
i = 2
while i < len(sys.argv):
   cont = atcg_prob(sys.argv[i])
   prob = 0
   for nuk in sys.argv[1]:
    prob = prob + math.log(cont[nuk], 10)
   #res = res + str(math.log(prob, 10)) + ' '
   res = res + str(round(prob, 3)) + ' '
   i += 1
print res
else:
print 'Enter datas!'

if __name__ == '__main__':
main()

вторник, 29 ноября 2016 г.

Хожу на химфак, слушаю лекции.

Хожу на химический факультет, слушаю органику. Многое понимаю, но сделать ничего не могу. Хотя было бы странно, если бы что-то могла, позанимавшись химией всего 60 часов. На занятиях очень интересно, прям как в кино. Тут пример того, что рассказывают.
Как получить фенолфталеин из фенола и фталевого ангидрида.
Вообще не очень понятно, почему атакуется только верхний кислород, а нижний нет. Но такая уж она химия - волшебная наука.

А тут показано почему фенолфталеин меняет окраску в щелочной среде. Как видно, от щелочи появляется много двойных связей по всей молекуле. Они то и меняет оптические свойства, так что щелочной среде он становится малиновым. Кстати, в кислой он розовый.

вторник, 22 ноября 2016 г.

Overlap Graphs (ROSALIND GRPH)

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
Return: The adjacency list corresponding to O3. You may return edges in any order.

import re

f = open('12.txt', 'r')
reads = re.findall(r'(Rosalind_[0-9]+)\n(([A-T]+\n)+)', f.read())
suffix = {}
prefix = {}
for s in reads:
string = s[1].replace('\n', '')
if len(string) > 3:
head = string[:3]
if head in prefix:
   prefix[head].append(s[0])
else:
   prefix[head] = [s[0]]
tail = string[-3:]
if tail in suffix:
   suffix[tail].append(s[0])
else:
   suffix[tail] = [s[0]]
for rec in suffix:
if rec in prefix:
i = 0
while i < len(suffix[rec]):
   j = 0
   while j < len(prefix[rec]):
    if suffix[rec][i] != prefix[rec][j]:
     print suffix[rec][i] + ' ' + prefix[rec][j]
    j += 1
   i += 1

суббота, 19 ноября 2016 г.

Consensus and Profile (ROSALIND CONS)

numpy helps!
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

import numpy as np
import re

f = open('11.txt', 'r')
strings = re.findall(r'(>Rosalind_[0-9]+)\n(([A-T]+\n)+)', f.read())
str_len = len(strings[0][1].replace('\n', ''))
profile = np.zeros((4, str_len))
for s in strings:
counter = 0
str_data = np.zeros((4, str_len))
st = s[1].replace('\n', '')
for i in st:
if i == 'A':
   str_data[0,counter] = 1
elif i == 'C':
   str_data[1,counter] = 1
elif i == 'G':
   str_data[2,counter] = 1
elif i == 'T':
   str_data[3,counter] = 1
counter += 1
profile = profile + str_data
consensus = ''
position = 0
while position < str_len:
column = profile[:, position]
column_max = column.max()
nucleotides = ['A', 'C', 'G', 'T']
i = 0
while i < 4:
if column[i] == column_max:
   consensus = consensus + nucleotides[i]
   break
i += 1
position += 1
print consensus
A_line = ''
C_line = ''
G_line = ''
T_line = ''
j = 0
while j < str_len:
A_line = A_line + str(int(profile[0, j])) + ' '
C_line = C_line + str(int(profile[1, j])) + ' '
G_line = G_line + str(int(profile[2, j])) + ' '
T_line = T_line + str(int(profile[3, j])) + ' '
j += 1
print 'A: ' + A_line
print 'C: ' + C_line
print 'G: ' + G_line
print 'T: ' + T_line

пятница, 18 ноября 2016 г.

Translating RNA into Protein (ROSALIND PROT)

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).
Return: The protein string encoded by s.


import sys



def codon_table():

 nukaa = {}

 nukaa['UUU'] = 'F'

 nukaa['CUU'] = 'L'

 nukaa['AUU'] = 'I'

 nukaa['GUU'] = 'V'

 nukaa['UUC'] = 'F'

 nukaa['CUC'] = 'L'

 nukaa['AUC'] = 'I'

 nukaa['GUC'] = 'V'

 nukaa['UUA'] = 'L'

 nukaa['CUA'] = 'L'

 nukaa['AUA'] = 'I'

 nukaa['GUA'] = 'V'

 nukaa['UUG'] = 'L'

 nukaa['CUG'] = 'L'

 nukaa['AUG'] = 'M'

 nukaa['GUG'] = 'V'

 nukaa['UCU'] = 'S'

 nukaa['CCU'] = 'P'

 nukaa['ACU'] = 'T'

 nukaa['GCU'] = 'A'

 nukaa['UCC'] = 'S'

 nukaa['CCC'] = 'P'

 nukaa['ACC'] = 'T'

 nukaa['GCC'] = 'A'

 nukaa['UCA'] = 'S'

 nukaa['CCA'] = 'P'

 nukaa['ACA'] = 'T'

 nukaa['GCA'] = 'A'

 nukaa['UCG'] = 'S'

 nukaa['CCG'] = 'P'

 nukaa['ACG'] = 'T'

 nukaa['GCG'] = 'A'

 nukaa['UAU'] = 'Y'

 nukaa['CAU'] = 'H'

 nukaa['AAU'] = 'N'

 nukaa['GAU'] = 'D'

 nukaa['UAC'] = 'Y'

 nukaa['CAC'] = 'H'

 nukaa['AAC'] = 'N'

 nukaa['GAC'] = 'D'

 nukaa['CAA'] = 'Q'

 nukaa['AAA'] = 'K'

 nukaa['GAA'] = 'E'

 nukaa['CAG'] = 'Q'

 nukaa['AAG'] = 'K'

 nukaa['GAG'] = 'E'

 nukaa['UGU'] = 'C'

 nukaa['CGU'] = 'R'

 nukaa['AGU'] = 'S'

 nukaa['GGU'] = 'G'

 nukaa['UGC'] = 'C'

 nukaa['CGC'] = 'R'

 nukaa['AGC'] = 'S'

 nukaa['GGC'] = 'G'

 nukaa['CGA'] = 'R'

 nukaa['AGA'] = 'R'

 nukaa['GGA'] = 'G'

 nukaa['UGG'] = 'W'

 nukaa['CGG'] = 'R'

 nukaa['AGG'] = 'R'

 nukaa['GGG'] = 'G'

 return nukaa





def main():

 if len(sys.argv) > 1:

  rna_string = sys.argv[1]

  nukaa = codon_table()

  nucl_str_len = len(rna_string) // 3

  nucl_string = ''

  i = 0

  while i < nucl_str_len:

   codon = rna_string[3*i:3*(i+1)]

   if codon=='UAA' or codon=='UAG' or codon=='UGA':

    break

   else:

     nucl_string += nukaa[codon]

   i+=1

  print nucl_string

 else:

  print 'Enter RNA sequence.'



if __name__ == '__main__':

 main()

Finding a Motif in DNA (ROSALIND SUBS)



Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s. 



s = 'TCACTCGATTGGAACCGA'

t = 'CACTCGACA'

occurence = []

start_position = 0

while True:

 start_position = s.find(t, start_position+1)

 if start_position == -1:

  break

 occurence.append(start_position+1)

res = ''

for i in occurence:

 res += str(i) + ' '

print res

среда, 16 ноября 2016 г.

Mendel's First Love (ROSALIND IPRB)

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.
Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

import sys

def main():
if len(sys.argv) > 1:
k = int(sys.argv[1])
m = int(sys.argv[2])
n = int(sys.argv[3])
T = k + m + n
Z = T*(T - 1)
print round(1-(n*(n-1) + 0.25*m*(m-1) + m*n)/Z, 5)
else:
print 'Enter k, m, n.'

if __name__ == '__main__':
main()

Counting Point Mutations (ROSALIND HAMM)

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).
Return: The Hamming distance dH(s,t).

import sys
def main():
if len(sys.argv) > 1:
s1 = sys.argv[1]
s2 = sys.argv[2]
diff_counter = 0
i = 0
while i < len(s1):
   if s1[i] != s2[i]:
    diff_counter += 1
   i += 1
print diff_counter
else:
print 'Enter 2 strings.'

if __name__ == '__main__':
main()

Computing GC Content (ROSALIND GC)

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

It was much easier to solve this with regular expressions, no cyrcles.
It doesn't work without from __future__ import division because I have python2.7.

from __future__ import division
import re

def cg_content(s):
CG = len(re.findall(r'C|G', s))
AT = len(re.findall(r'A|T', s))
ACGT = CG + AT
return round(100*CG/ACGT, 6)

f = open('6.txt', 'r')
strings = re.findall(r'(Rosalind_.[0-9]+)\n(([A-Z]+\n)+)', f.read())
max_cg_content = {}
max_cg_content['id_string'] = ''
max_cg_content['string'] = ''
max_cg_content['cg_content'] = 0
for s in strings:
string_cg_content = cg_content(s[1])
if string_cg_content > max_cg_content['cg_content']:
max_cg_content['id_string'] = s[0]
max_cg_content['string'] = s[1]
max_cg_content['cg_content'] = string_cg_content
print max_cg_content['id_string']
print max_cg_content['cg_content']

воскресенье, 13 ноября 2016 г.

ROSALIND FIBD

Given: Positive integers n≤100 and m≤20.
Return: The total number of pairs of rabbits that will remain after the n
-th month if all rabbits live for m months.

import sys

def F(n, m):
if n < 2:
return 1
else:
Old = [0, 0, 1]
New = [0, 1, 0]
month = 3
while month <= n:
   born_to_die = month - m - 1
   if born_to_die < 0:
    Died = 0
   else:
    Died = New[born_to_die]
   Old_add = Old[month - 1] + New[month - 1] - Died
   New_add = Old[month - 1] - Died
   Old.append(Old_add)
   New.append(New_add)
   month+=1
return Old[n] + New[n] - New[n - m]

def main():
if len(sys.argv) > 1:
print F(int(sys.argv[1]), int(sys.argv[2]))
else:
print 'Enter n and m.'

if __name__ == '__main__':
main()

суббота, 12 ноября 2016 г.

ROSALIND FIB

Given: Positive integers n≤40 and k≤5.
Return: The total number of rabbit pairs that will be present after n
months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

import sys

def F(n, k):
Fn_2 = 1
Fn_1 = k+1
if n > 1:
i = 3
while i < n:
   Fn = Fn_1 + k*Fn_2
   Fn_2 = Fn_1
   Fn_1 = Fn
   i+=1
return Fn
elif n == 1:
return 1
elif n == 2:
return k+1

def main():
if len(sys.argv) > 1:
print F(int(sys.argv[1]), int(sys.argv[2]))
else:
print 'Enter k and n.'

if __name__ == '__main__':
main()

четверг, 10 ноября 2016 г.

ROSALIND REVC

Given: A DNA string s of length at most 1000 bp.
Return: The reverse complement of s.

import sys
import re

def rev_comp(s):
res = ''
for i in s:
if i == 'A':
   res += 'T'
elif i == 'G':
   res += 'C'
elif i == 'T':
   res += 'A'
elif i == 'C':
   res += 'G'
return res[::-1]

def main():
if len(sys.argv) > 1:
print rev_comp(sys.argv[1])
else:
print 'Enter your sequence!'

if __name__ == '__main__':
main()

ROSALIND RNA

Given: A DNA string t having length at most 1000 nt.
Return: The transcribed RNA string of t.

import sys
import re

def main():
if len(sys.argv) > 1:
print re.sub(r'T', r'U', sys.argv[1])
else:
print 'Enter your sequence!'

if __name__ == '__main__':
main()

ROSALIND DNA

Given: A DNA string s of length at most 1000 nt.
Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

I've wrote even two functions to count lettes:

import sys
import re

def cycle(s):
a_num = 0
t_num = 0
g_num = 0
c_num = 0
for i in s:
if i == 'A':
   a_num += 1
elif i == 'T':
   t_num += 1
elif i == 'G':
   g_num += 1
elif i == 'C':
   c_num += 1
return str(a_num) + ' ' + str(c_num) + ' ' + str(g_num) + ' ' + str(t_num)

def reg_expr(s):
A = re.findall(r'A', s)
C = re.findall(r'C', s)
G = re.findall(r'G', s)
T = re.findall(r'T', s)
return str(len(A)) + ' ' + str(len(C)) + ' ' + str(len(G)) + ' ' + str(len(T))

def main():
if len(sys.argv) > 1:
print reg_expr(sys.argv[1])   #cycle(sys.argv[1])
else:
print 'Enter your sequence!'

if __name__ == '__main__':
main()

Развлекательныне сайты по биоинформатике

Заметки о генетике, с историческим уклоном.

Где учить питон понравилось у гугла.

Про горизонтальный перенос генов.

Митотический кроссинговер.

Сайты для биоинформатика

Информация по геномам NCBI

Статьи по биомедицине PubMed

Шкалы Phred 33 Phred 64

среда, 28 декабря 2016 г.

понедельник, 19 декабря 2016 г.

суббота, 17 декабря 2016 г.

среда, 14 декабря 2016 г.

суббота, 10 декабря 2016 г.

среда, 30 ноября 2016 г.

вторник, 29 ноября 2016 г.

вторник, 22 ноября 2016 г.

суббота, 19 ноября 2016 г.

пятница, 18 ноября 2016 г.

среда, 16 ноября 2016 г.

воскресенье, 13 ноября 2016 г.

суббота, 12 ноября 2016 г.

четверг, 10 ноября 2016 г.