Естественные науки и программирование: декабря 2016

среда, 28 декабря 2016 г.

Finding a Shared Motif (ROSALIND LCSM)

Given: A collection of

k

(

k \leq 100

) DNA strings of length at most 1 kbp each in FASTA format.

k \leq 100

Return: A longest common substring of the collection. (If multiple solutions exist, you may return any single solution.)

import re

def main():
print MainFunc()

def MainFunc():
f = open('17.txt', 'r')
string = ''
strings = []
for line in f:
if line[0] == '>':
   if string != '':
    string = string.replace('\n', '')
    strings.append(string)
    string = ''
else:
   string = string + line
if string != '':
    string = string.replace('\n', '')
    strings.append(string)
global s1, s2
s1 = strings.pop()
s2 = strings.pop()
m = []
m.append(['A', 'C', 'G', 'T'])
pre_motifs = pre_motif(m)
len_pre_motifs = len(pre_motifs)
i = len_pre_motifs - 1
suit = 1
while i >= 0:
for motif in pre_motifs[i]:
   for s in strings:
    if not re.search(motif, s):
     suit = 0
     break
   if suit == 1:
    return motif
i -= 1
return 'Found nothing.'

def pre_motif(m):
letters = ['A', 'C', 'G', 'T']
lenm = len(m)
new_m_arr = []
for head in m[lenm - 1]:
for letter in letters:
   new_m = head + letter
   match_s1 = re.search(new_m, s1)
   if match_s1:
    match_s2 = re.search(new_m, s2)
    if match_s2:
     new_m_arr.append(new_m)
if len(new_m_arr) > 0:
m.append(new_m_arr)
return pre_motif(m)
else:
return m

if __name__ == "__main__":
main()

понедельник, 19 декабря 2016 г.

Интересная задача на сочетания (решетка)

Пирамидка, в которую играет ребенок, состоит из

49

дисков, по

7

каждого размера (диски одного размера неразличимы). Пирамидка устроена таким образом, что на нижнем слое может находиться только самый большой диск, а на верхнем слое только самый маленький. Мы будем называть правильной сборкой пирамидки такую последовательность из семи дисков, что каждый следующий не больше предыдущего, первый диск — самого большого размера, а последний — самого маленького (например,

7654321,

7775331,

и 7

7222211

— правильные сборки, а

7654322

-

нет). Сколько всего существует правильных сборок? Постарайтесь проинтерпретировать данную задачу в терминах путей на решетке.

суббота, 17 декабря 2016 г.

Задачка на множества

Введите первые девять цифр числа перестановок всех букв английского алфавита, которые не содержат в себе подстрок

f i s h,

r a t

Используйте принцип включений-исключений и рассуждения, похожие на те, которые мы проводили, когда получали общее количество перестановок.

среда, 14 декабря 2016 г.

Genome Assembly as Shortest Superstring (ROSALIND LONG)

Enother interesting task!
What I do: suppose there is such superstring S. Let the first read coordinate on S be 0. In MainFunc I find all right tails for first read and calculate their coordinates on S. Using recursion, I find everething that goes after first read to the right.
Then I reverse all that rests and do the same. Then I reverse it back and get the "left tail".

Given: At most 50 DNA strings whose length does not exceed 1 kbp in FASTA format (which represent reads deriving from the same strand of a single linear chromosome).

The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.

Return: A shortest superstring containing all the given strings (thus corresponding to a reconstructed chromosome).

def find_middle(len1):
if len1 % 2 == 0:
return len1/2 + 1
else:
return (len1 + 1)/2

def find_max_middle(len1, len2):
mlen1 = find_middle(len1)
mlen2 = find_middle(len2)
if mlen1 > mlen2:
return mlen1
else:
return mlen2

def MainFunc(find_what_reads, find_where_reads, find_what_positions, res_string):
find_what_reads1 = []
find_what_positions1 = []
find_what_counter = 0
for find_what in find_what_reads:
find_where_counter = 0
for find_where in find_where_reads:
   find_what_len = len(find_what)
   find_where_len = len(find_where)
   max_middle = find_max_middle(find_what_len, find_where_len)
   a = find_what_len - max_middle
   find_what_part = find_what[a:]
   coincidence = find_where.find(find_what_part, 0)
   if coincidence > 0:
    find_where_res_string_position = find_what_positions[find_what_counter] + a - coincidence
    d = find_where_len - len(res_string[find_where_res_string_position:])
    if d > 0:
     res_string = res_string + find_where[-d:]
    find_where_reads[find_where_counter] = ''
    find_what_reads1.append(find_where)
    find_what_positions1.append(find_where_res_string_position)
   find_where_counter += 1
find_what_counter += 1
if len(find_what_reads1) > 0:
res_string = MainFunc(find_what_reads1, find_where_reads, find_what_positions1, res_string)['res_string']
find_where_reads = MainFunc(find_what_reads1, find_where_reads, find_what_positions1, res_string)['find_where_reads']
res = {}
res['res_string'] = res_string
res['find_where_reads'] = find_where_reads
return res

def main():
reads = []
f = open('16.txt', 'r')
read = ''
for line in f:
if line[0] != '>':
   read = read + line
else:
   if read != '':
    read_clean = read.replace('\n', '')
    reads.append(read_clean)
   read = ''
read_clean = read.replace('\n', '')
reads.append(read_clean)
res_string = reads[0]
n = len(res_string)
res_string_right2left = reads[0]
find_what_positions = [0]
find_what_reads = [reads[0]]
res_left2right = MainFunc(find_what_reads, reads, find_what_positions, res_string)
string_left2right = res_left2right['res_string']
find_where_reads_left2right = res_left2right['find_where_reads']
find_where_reads_right2left = []
for r in find_where_reads_left2right:
find_where_reads_right2left.append(r[::-1])
res_right2left = MainFunc([res_string_right2left[::-1]], find_where_reads_right2left, find_what_positions, res_string_right2left[::-1])
string_right2left = res_right2left['res_string']
print string_right2left[n:][::-1] + string_left2right

if __name__ == '__main__':
main()

суббота, 10 декабря 2016 г.

Speeding Up Motif Finding (ROSALIND KMP)

Given: A DNA string s (of length at most 100 kbp) in FASTA format.
Return: The failure array of s.

f = open('15.txt','r')
pre_s = ''
for line in f:
pre_s = pre_s + line
f.close()
s = pre_s.replace('\n', '')
len_s = len(s)
first = s[0]
i = 1
P = [0]
matches = [] # list with matches lengths
while i < len_s:
max_match = 0 # max match length for current string symbol
j = 0
len_matches = len(matches)
while j < len_matches:
match_num = matches[j]
if s[i] == s[match_num]:
   match_num_pl1 = match_num + 1
   matches[j] = match_num_pl1
   if match_num_pl1 > max_match:
    max_match = match_num_pl1
else:
   matches.pop(j)
   j -= 1
   len_matches -= 1
j += 1
if s[i] == first:
matches.append(1)
if max_match < 1:
   max_match = 1
i += 1
P.append(max_match)
res = ''
for k in P:
res = res + ' ' + str(k)
f1 = open('15_1.txt', 'w')
f1.write(res)
f1.close()

Independent Alleles (ROSALIND LIA)

The most interesting problem by now!

Solution:

Code:

from __future__ import division
import sys
import math
from scipy import special

def main():
if len(sys.argv) > 2:
k = int(sys.argv[1])
N = int(sys.argv[2])
descend_num = pow(2, k)
i = N
prob = 0
while i <= descend_num:
   p = descend_num - i
   prob = prob + special.binom(descend_num, i)*pow(1/4, i)*pow(3/4, p)
   i += 1
print round(prob, 3)
else:
print 'Enter data!'

if __name__ == '__main__':
main()

среда, 28 декабря 2016 г.