Естественные науки и программирование: ноября 2016

среда, 30 ноября 2016 г.

Introduction to Random Strings (ROSALIND PROB)

If I want to find the propobility that string s and some random string are equal, it'll 0.25 to the power of s length. Here I have not a random string, and it's not 0.25, but some other number, which is easy to find from gc-content.
Hint in this problem is very useful.

Given: A DNA string

s

of length at most 100 bp and an array

A

containing at most 20 numbers between 0 and 1.
Return: An array

B

having the same length as

A

in which

B [k]

represents the common logarithm of the probability that a random string constructed with the GC-content found in

A [k]

will match

s

exactly.
Hint: One property of the logarithm function is that for any positive numbers

x

and

y

\log_{10} (x \cdot y) = \log_{10} (x) + \log_{10} (y)

.

from __future__ import division
import re
import sys
import math

def atcg_prob(x):
cg_prob = float(x)
at_prob = 1 - cg_prob
atcg_prob = {}
atcg_prob['A'] = at_prob / 2
atcg_prob['T'] = atcg_prob['A']
atcg_prob['C'] = cg_prob / 2
atcg_prob['G'] = atcg_prob['C']
return atcg_prob

def main():
if len(sys.argv) > 1:
res = ''
i = 2
while i < len(sys.argv):
   cont = atcg_prob(sys.argv[i])
   prob = 0
   for nuk in sys.argv[1]:
    prob = prob + math.log(cont[nuk], 10)
   #res = res + str(math.log(prob, 10)) + ' '
   res = res + str(round(prob, 3)) + ' '
   i += 1
print res
else:
print 'Enter datas!'

if __name__ == '__main__':
main()

вторник, 29 ноября 2016 г.

Хожу на химфак, слушаю лекции.

Хожу на химический факультет, слушаю органику. Многое понимаю, но сделать ничего не могу. Хотя было бы странно, если бы что-то могла, позанимавшись химией всего 60 часов. На занятиях очень интересно, прям как в кино. Тут пример того, что рассказывают.
Как получить фенолфталеин из фенола и фталевого ангидрида.
Вообще не очень понятно, почему атакуется только верхний кислород, а нижний нет. Но такая уж она химия - волшебная наука.

А тут показано почему фенолфталеин меняет окраску в щелочной среде. Как видно, от щелочи появляется много двойных связей по всей молекуле. Они то и меняет оптические свойства, так что щелочной среде он становится малиновым. Кстати, в кислой он розовый.

вторник, 22 ноября 2016 г.

Overlap Graphs (ROSALIND GRPH)

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.
Return: The adjacency list corresponding to O3. You may return edges in any order.

import re

f = open('12.txt', 'r')
reads = re.findall(r'(Rosalind_[0-9]+)\n(([A-T]+\n)+)', f.read())
suffix = {}
prefix = {}
for s in reads:
string = s[1].replace('\n', '')
if len(string) > 3:
head = string[:3]
if head in prefix:
   prefix[head].append(s[0])
else:
   prefix[head] = [s[0]]
tail = string[-3:]
if tail in suffix:
   suffix[tail].append(s[0])
else:
   suffix[tail] = [s[0]]
for rec in suffix:
if rec in prefix:
i = 0
while i < len(suffix[rec]):
   j = 0
   while j < len(prefix[rec]):
    if suffix[rec][i] != prefix[rec][j]:
     print suffix[rec][i] + ' ' + prefix[rec][j]
    j += 1
   i += 1

суббота, 19 ноября 2016 г.

Consensus and Profile (ROSALIND CONS)

numpy helps!
Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.
Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

import numpy as np
import re

f = open('11.txt', 'r')
strings = re.findall(r'(>Rosalind_[0-9]+)\n(([A-T]+\n)+)', f.read())
str_len = len(strings[0][1].replace('\n', ''))
profile = np.zeros((4, str_len))
for s in strings:
counter = 0
str_data = np.zeros((4, str_len))
st = s[1].replace('\n', '')
for i in st:
if i == 'A':
   str_data[0,counter] = 1
elif i == 'C':
   str_data[1,counter] = 1
elif i == 'G':
   str_data[2,counter] = 1
elif i == 'T':
   str_data[3,counter] = 1
counter += 1
profile = profile + str_data
consensus = ''
position = 0
while position < str_len:
column = profile[:, position]
column_max = column.max()
nucleotides = ['A', 'C', 'G', 'T']
i = 0
while i < 4:
if column[i] == column_max:
   consensus = consensus + nucleotides[i]
   break
i += 1
position += 1
print consensus
A_line = ''
C_line = ''
G_line = ''
T_line = ''
j = 0
while j < str_len:
A_line = A_line + str(int(profile[0, j])) + ' '
C_line = C_line + str(int(profile[1, j])) + ' '
G_line = G_line + str(int(profile[2, j])) + ' '
T_line = T_line + str(int(profile[3, j])) + ' '
j += 1
print 'A: ' + A_line
print 'C: ' + C_line
print 'G: ' + G_line
print 'T: ' + T_line

пятница, 18 ноября 2016 г.

Translating RNA into Protein (ROSALIND PROT)

Given: An RNA string s corresponding to a strand of mRNA (of length at most 10 kbp).
Return: The protein string encoded by s.


import sys



def codon_table():

 nukaa = {}

 nukaa['UUU'] = 'F'

 nukaa['CUU'] = 'L'

 nukaa['AUU'] = 'I'

 nukaa['GUU'] = 'V'

 nukaa['UUC'] = 'F'

 nukaa['CUC'] = 'L'

 nukaa['AUC'] = 'I'

 nukaa['GUC'] = 'V'

 nukaa['UUA'] = 'L'

 nukaa['CUA'] = 'L'

 nukaa['AUA'] = 'I'

 nukaa['GUA'] = 'V'

 nukaa['UUG'] = 'L'

 nukaa['CUG'] = 'L'

 nukaa['AUG'] = 'M'

 nukaa['GUG'] = 'V'

 nukaa['UCU'] = 'S'

 nukaa['CCU'] = 'P'

 nukaa['ACU'] = 'T'

 nukaa['GCU'] = 'A'

 nukaa['UCC'] = 'S'

 nukaa['CCC'] = 'P'

 nukaa['ACC'] = 'T'

 nukaa['GCC'] = 'A'

 nukaa['UCA'] = 'S'

 nukaa['CCA'] = 'P'

 nukaa['ACA'] = 'T'

 nukaa['GCA'] = 'A'

 nukaa['UCG'] = 'S'

 nukaa['CCG'] = 'P'

 nukaa['ACG'] = 'T'

 nukaa['GCG'] = 'A'

 nukaa['UAU'] = 'Y'

 nukaa['CAU'] = 'H'

 nukaa['AAU'] = 'N'

 nukaa['GAU'] = 'D'

 nukaa['UAC'] = 'Y'

 nukaa['CAC'] = 'H'

 nukaa['AAC'] = 'N'

 nukaa['GAC'] = 'D'

 nukaa['CAA'] = 'Q'

 nukaa['AAA'] = 'K'

 nukaa['GAA'] = 'E'

 nukaa['CAG'] = 'Q'

 nukaa['AAG'] = 'K'

 nukaa['GAG'] = 'E'

 nukaa['UGU'] = 'C'

 nukaa['CGU'] = 'R'

 nukaa['AGU'] = 'S'

 nukaa['GGU'] = 'G'

 nukaa['UGC'] = 'C'

 nukaa['CGC'] = 'R'

 nukaa['AGC'] = 'S'

 nukaa['GGC'] = 'G'

 nukaa['CGA'] = 'R'

 nukaa['AGA'] = 'R'

 nukaa['GGA'] = 'G'

 nukaa['UGG'] = 'W'

 nukaa['CGG'] = 'R'

 nukaa['AGG'] = 'R'

 nukaa['GGG'] = 'G'

 return nukaa





def main():

 if len(sys.argv) > 1:

  rna_string = sys.argv[1]

  nukaa = codon_table()

  nucl_str_len = len(rna_string) // 3

  nucl_string = ''

  i = 0

  while i < nucl_str_len:

   codon = rna_string[3*i:3*(i+1)]

   if codon=='UAA' or codon=='UAG' or codon=='UGA':

    break

   else:

     nucl_string += nukaa[codon]

   i+=1

  print nucl_string

 else:

  print 'Enter RNA sequence.'



if __name__ == '__main__':

 main()

Finding a Motif in DNA (ROSALIND SUBS)



Given: Two DNA strings s and t (each of length at most 1 kbp).

Return: All locations of t as a substring of s. 



s = 'TCACTCGATTGGAACCGA'

t = 'CACTCGACA'

occurence = []

start_position = 0

while True:

 start_position = s.find(t, start_position+1)

 if start_position == -1:

  break

 occurence.append(start_position+1)

res = ''

for i in occurence:

 res += str(i) + ' '

print res

среда, 16 ноября 2016 г.

Mendel's First Love (ROSALIND IPRB)

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.
Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

import sys

def main():
if len(sys.argv) > 1:
k = int(sys.argv[1])
m = int(sys.argv[2])
n = int(sys.argv[3])
T = k + m + n
Z = T*(T - 1)
print round(1-(n*(n-1) + 0.25*m*(m-1) + m*n)/Z, 5)
else:
print 'Enter k, m, n.'

if __name__ == '__main__':
main()

Counting Point Mutations (ROSALIND HAMM)

Given: Two DNA strings s and t of equal length (not exceeding 1 kbp).
Return: The Hamming distance dH(s,t).

import sys
def main():
if len(sys.argv) > 1:
s1 = sys.argv[1]
s2 = sys.argv[2]
diff_counter = 0
i = 0
while i < len(s1):
   if s1[i] != s2[i]:
    diff_counter += 1
   i += 1
print diff_counter
else:
print 'Enter 2 strings.'

if __name__ == '__main__':
main()

Computing GC Content (ROSALIND GC)

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

It was much easier to solve this with regular expressions, no cyrcles.
It doesn't work without from __future__ import division because I have python2.7.

from __future__ import division
import re

def cg_content(s):
CG = len(re.findall(r'C|G', s))
AT = len(re.findall(r'A|T', s))
ACGT = CG + AT
return round(100*CG/ACGT, 6)

f = open('6.txt', 'r')
strings = re.findall(r'(Rosalind_.[0-9]+)\n(([A-Z]+\n)+)', f.read())
max_cg_content = {}
max_cg_content['id_string'] = ''
max_cg_content['string'] = ''
max_cg_content['cg_content'] = 0
for s in strings:
string_cg_content = cg_content(s[1])
if string_cg_content > max_cg_content['cg_content']:
max_cg_content['id_string'] = s[0]
max_cg_content['string'] = s[1]
max_cg_content['cg_content'] = string_cg_content
print max_cg_content['id_string']
print max_cg_content['cg_content']

воскресенье, 13 ноября 2016 г.

ROSALIND FIBD

Given: Positive integers n≤100 and m≤20.
Return: The total number of pairs of rabbits that will remain after the n
-th month if all rabbits live for m months.

import sys

def F(n, m):
if n < 2:
return 1
else:
Old = [0, 0, 1]
New = [0, 1, 0]
month = 3
while month <= n:
   born_to_die = month - m - 1
   if born_to_die < 0:
    Died = 0
   else:
    Died = New[born_to_die]
   Old_add = Old[month - 1] + New[month - 1] - Died
   New_add = Old[month - 1] - Died
   Old.append(Old_add)
   New.append(New_add)
   month+=1
return Old[n] + New[n] - New[n - m]

def main():
if len(sys.argv) > 1:
print F(int(sys.argv[1]), int(sys.argv[2]))
else:
print 'Enter n and m.'

if __name__ == '__main__':
main()

суббота, 12 ноября 2016 г.

ROSALIND FIB

Given: Positive integers n≤40 and k≤5.
Return: The total number of rabbit pairs that will be present after n
months, if we begin with 1 pair and in each generation, every pair of reproduction-age rabbits produces a litter of k rabbit pairs (instead of only 1 pair).

import sys

def F(n, k):
Fn_2 = 1
Fn_1 = k+1
if n > 1:
i = 3
while i < n:
   Fn = Fn_1 + k*Fn_2
   Fn_2 = Fn_1
   Fn_1 = Fn
   i+=1
return Fn
elif n == 1:
return 1
elif n == 2:
return k+1

def main():
if len(sys.argv) > 1:
print F(int(sys.argv[1]), int(sys.argv[2]))
else:
print 'Enter k and n.'

if __name__ == '__main__':
main()

четверг, 10 ноября 2016 г.

ROSALIND REVC

Given: A DNA string s of length at most 1000 bp.
Return: The reverse complement of s.

import sys
import re

def rev_comp(s):
res = ''
for i in s:
if i == 'A':
   res += 'T'
elif i == 'G':
   res += 'C'
elif i == 'T':
   res += 'A'
elif i == 'C':
   res += 'G'
return res[::-1]

def main():
if len(sys.argv) > 1:
print rev_comp(sys.argv[1])
else:
print 'Enter your sequence!'

if __name__ == '__main__':
main()

ROSALIND RNA

Given: A DNA string t having length at most 1000 nt.
Return: The transcribed RNA string of t.

import sys
import re

def main():
if len(sys.argv) > 1:
print re.sub(r'T', r'U', sys.argv[1])
else:
print 'Enter your sequence!'

if __name__ == '__main__':
main()

ROSALIND DNA

Given: A DNA string s of length at most 1000 nt.
Return: Four integers (separated by spaces) counting the respective number of times that the symbols 'A', 'C', 'G', and 'T' occur in s.

I've wrote even two functions to count lettes:

import sys
import re

def cycle(s):
a_num = 0
t_num = 0
g_num = 0
c_num = 0
for i in s:
if i == 'A':
   a_num += 1
elif i == 'T':
   t_num += 1
elif i == 'G':
   g_num += 1
elif i == 'C':
   c_num += 1
return str(a_num) + ' ' + str(c_num) + ' ' + str(g_num) + ' ' + str(t_num)

def reg_expr(s):
A = re.findall(r'A', s)
C = re.findall(r'C', s)
G = re.findall(r'G', s)
T = re.findall(r'T', s)
return str(len(A)) + ' ' + str(len(C)) + ' ' + str(len(G)) + ' ' + str(len(T))

def main():
if len(sys.argv) > 1:
print reg_expr(sys.argv[1])   #cycle(sys.argv[1])
else:
print 'Enter your sequence!'

if __name__ == '__main__':
main()

Развлекательныне сайты по биоинформатике

Заметки о генетике, с историческим уклоном.

Где учить питон понравилось у гугла.

Про горизонтальный перенос генов.

Митотический кроссинговер.

Сайты для биоинформатика

Информация по геномам NCBI

Статьи по биомедицине PubMed

Шкалы Phred 33 Phred 64

среда, 30 ноября 2016 г.

вторник, 29 ноября 2016 г.

вторник, 22 ноября 2016 г.

суббота, 19 ноября 2016 г.

пятница, 18 ноября 2016 г.

среда, 16 ноября 2016 г.

воскресенье, 13 ноября 2016 г.

суббота, 12 ноября 2016 г.

четверг, 10 ноября 2016 г.