понедельник, 20 февраля 2017 г.

Find a substring

t = '010's = '0111010100001'print(s.find(t)) - first position

Common problem: we have string s (usually very long) and string t (usually short). Find positions, where t is a substring of s.
Example: for s = 'TATGCAATGC' and t = 'TGC' positions are: 2, 7.

Trere are several algorithms to find them: wikipedia.
I'll show here two easiest ways:
naive algorithm (find all pos):
def substring_positions(s,t):
    res = []
    for i in range(len(s) - len(t) + 1):
        flag = True        for j in range(len(t)):
            if s[i + j] != t[j]:
                flag = False                break        if flag:
            res.append(str(i))
    return res

knuth-morris-pratt algorithm:
def kmp(s,t, limit = 1):
    # building prefix table pt for s    # it's i-th element shows, how many steps go backward on s    # if there will be a mismatch with t string    pt = [0]
    for m in s[1:]:
        j = pt[-1]
        while j > 0 and m != s[j]:
            j = pt[j-1]
        if m == s[j]:
            j += 1        pt.append(j)
    # walk through the s    res = []
    j = 0 # number or symbols matched with t    for i in range(len(s)):
        while j > 0 and s[i] != t[j]:
            j = pt[j - 1]
        if s[i] == t[j]:
            j += 1        if j == len(t):
            res.append(i - j + 1)
            if len(res) == limit:
                return res
            j = 0    return res
 

Комментариев нет:

Отправить комментарий