字符串匹配,允许一个字符不匹配

lood339 2012-09-01 04:10:46

Oh!! Mankind is in trouble again.This time its a deadly disease spreading at a rate never seen before. Efficient detectors for the virus responsible is the need of the hour. You are the lead at Central Hospital and need to find a fast and reliable way to detect the 'foot-prints' of the virus DNA in that of patient.

The DNA of the patient as well as of the virus consists of lower case letters. Since the data collected is raw there may be some errors. You will need to find all substrings in the patient DNA that exactly matches the virus DNA with the exception of at most one mismatch.

For example tolerating at most one mismatch, "aa" and "aa" are matching, "ab" and "aa" are matching, while "ab" and "ba" are not.

Input:
The first line contains the number of test cases T. T cases follow. Each case contains two lines containing strings P(Patient DNA) and V(Virus DNA) . Each case is followed by a blank line.

Output:
Output T lines, one corresponding to each case. For each case, output a space delimited list of starting indices (0 indexed) of substrings of P which are matching with V according to the condition mentioned above . The indices has to be in increasing order.

Constraints:
1 <= T <= 10
P and V contain at most 100000 characters each.
All characters in P and V are lowercase letters.

Sample Input:
3
abbab
ba

hello

world

banana

nan

Sample Output:
1 2

0 2

Explanation:
For the first case, the substrings of P starting at indices 1 and 2 are "bb" and "ba" and they are matching with the string V which is "ba".
For the second case, there are no matching substrings so the output is a blank line.

题目大意：
给定两个字符串s1,s2。在s1中找到s2（最多允许一个字符不相同）。
相同字符串匹配可以用kmp,所以，我的思路是改变一下kmp。
1，如果字符串完全相同，直接用kmp。
2，如果有一个字符串不相同，那么在第一个不相同的地方做标记,p_i, p_j，然后，继续比较下去，直到第二个不匹配的字符或者后面的字符全部相符。
前面两种情况在一次匹配完后，都要将下标(分别用i，j表示)移动。
情况1，j = pat[j-1], i = i+1
情况2, i, j = p_i, p_j #先移到第一次不匹配的地方
while j > 0 and tmp[j] != src[i]:
j = pat[j-1] #类似kmp方法将j尽量向前移动
这是一次匹配的情况，每次匹配都返回当前下标i,j,开始下标，是否匹配四个参数

代码：

def kmp_pattern(seq):

    P = [0]*len(seq)

    j = 0

    for i in range(1, len(seq)):

        while j > 0 and seq[j] != seq[i]:

            j = P[j-1]

        if seq[j] == seq[i]:

            j += 1

        P[i] = j

    return P



def kmp_match(src, tmp, pat):

    ret = []

    j = 0

    for i in range(len(src)):

        while j > 0 and tmp[j] != src[i]:

            j = pat[j-1]

        if tmp[j] == src[i]:

            j += 1

        if j == len(tmp):

            ret.append(i-len(tmp)+1)

            j = pat[j-1]

    return ret



def kmp_mismatch_once(src, tmp, pat, i, j):

    p_i, p_j = i, j

    state = 0

    s_len, t_len = len(src), len(tmp)

    while i<len(src):

        if j<t_len and state == 0:

            if src[i] == tmp[j]:

                j += 1

            else:                

                state = 1

                p_i, p_j = i, j

                j += 1

        elif j<t_len and state == 1:

            if src[i] == tmp[j]:

                j += 1

            else:

                #not match

                i, j = p_i, p_j

                while j>0 and tmp[j] != src[i]:

                    j = pat[j-1]

                return i, j, i-j, False



        if j == len(tmp):

            ret = i - t_len + 1

            if state == 0:

                j = pat[j-1]

                return i+1, j, ret, True

            elif state == 1:

                i, j = p_i, p_j

                while j>0 and tmp[j] != src[i]:

                    j = pat[j-1]

                return i, j, ret, True               

        i += 1

    return i, j, 0, False    





def kmp_one_mismatch(src, tmp, pat):

    pre_i, pre_j, i, j = 0, 0, 0, 0

    pre_idx = -1

    idx, ret2 = 0, False

    ret = []

    while i<len(src):

        i, j, idx, ret2 =  kmp_mismatch_once(src, tmp, pat, i, j)

        if pre_i == i and pre_j == j and idx == pre_idx: #repeat

            i += 1

        else:

            pre_i, pre_j = i, j

            pre_idx = idx

            if ret2 == True:

                ret.append(idx)                             

    return ret  



sys.stdin=open( 'input00.txt', 'r')

T = int(raw_input())

while T:

    T -= 1

    src = raw_input()

    tmp = raw_input()

    if T != 0:

        emp = raw_input()

    src = list(src)

    tmp = list(tmp)



    pat = kmp_pattern(tmp)

    idces = kmp_one_mismatch(src, tmp, pat)

    for v in idces:

        print(v),

    print('')

提交的结果是wrong error。这个问题前前后后搞了我2周了。我怀疑我的逻辑上有问题。
请帮助一下，谢谢。

...全文

801 14 打赏收藏转发到动态举报

写回复

用AI写文章

14 条回复

切换为时间正序

请发表友善的回复…

发表回复

李富生 2012-12-06

打赏
举报

不可以像牛耕田一样来做吗？

取得查找字串长度，然后在被查找的字串截取同长度的字串出来，符合条件存入匹配数组，下移一位，再对比，直到最后，不是匹配条件的都出来了么？为何要想得如此复杂呢？

lood339 2012-12-06

打赏
举报

这个问题没有被解决。

Prairial_0 2012-11-06

打赏
举报

引用 11 楼 heavyok 的回复:

引用 7 楼 Prairial_0 的回复:目测可以用NFA，求链接详细说说算法？复杂度为多少？

和9L做法差不多，如果javascript的正则表达式是转化为自动机来处理的话个人理解本质上kmp就是在跑一个nfa。用nfa理解可能简单一点，只是个思路而已，复杂度我也不知道。

heavyok 2012-11-06

打赏
举报

引用 7 楼 Prairial_0 的回复:

目测可以用NFA，求链接

详细说说算法？复杂度为多少？

heavyok 2012-11-06

打赏
举报

引用 8 楼 aihahaheihei 的回复:

哈希可行，模式串分解成 26* 10w 种，可以线性求出所有的 hash key, 同样可以线性求出源串所有长度等于模式串长度的子串的 hash key 。复杂度大概也就 260w .

试了，你说的算法，其中一些test case 还是跑不过，超时了。分析下，因为Hash可能存在冲突，因此每次Hash值相同时还是需要检测一次是否匹配。这样情况下，如果冲突很多，或者匹配的位置很多（例如每一位都匹配），那么复杂度变为O(M*N)。

prototyper 2012-11-06

打赏
举报

感觉正则也是一种算法——逻辑算法：


String.prototype.findSimilar = function (str, sym) {
    var len = str.length, sym = sym || ","; 
    var reg = "(?=.?["+str+"]{"+[len-1]+"})[^"+sym+"]{"+len+"}(?="+sym+")";
    return this.match(new RegExp(reg, "g"));
}
alert(s1.findSimilar(s2, symbol));//symbol为s1中分隔符，缺省为“,”

DeDeWo 2012-11-05

打赏
举报

哈希可行，模式串分解成 26* 10w 种，可以线性求出所有的 hash key, 同样可以线性求出源串所有长度等于模式串长度的子串的 hash key 。复杂度大概也就 260w .

Prairial_0 2012-11-05

打赏
举报

目测可以用NFA，求链接

heavyok 2012-11-05

打赏
举报

hi，你这个问题解决了么？我最近也在做这个题呵呵。这个问题能用类似KMP的方法解决么？我觉得i, j都回溯的方法不行，举个例子： 0 1 2 3 4 5 6 母串： a b b b f f g 子串： a b b f f g 当子串中最后的g与母串f不匹配时，我们需要回溯到3，因为3是第一个不匹配的地方。由子串3的f得到next值应该是0，但是显然，如果让0（子串开始的a）和3位置匹配，就丢了（b b b f f g）这个匹配。你想到好的方法了么？

施工中请绕行 2012-09-06

打赏
举报

试试用NFA或者DFA解决

天下第一好大人 2012-09-06

打赏
举报

[Quote=引用 2 楼的回复:]

换我做的话我会用rabin karp，配上int64的hash。事先把匹配串的所有1mismatch的hash给算出来。一共也就26*m的hash数，而且能线性的求出所有的hash。排个序以便二分搜索。接下来就是算原串所有长度m的子串，然后看是不是在刚才求出的列表里。整个可以做到nlogn。

kmp我一直不算掌握的很好，所以你这个对不对我还真说不上- -b
[/Quote]

要是10个case的P和V都是最长的长度，你这个方法估计得TLE。

ghbxx2004 2012-09-06

打赏
举报

用KMP，稍微变一下型就行了

FancyMouse 2012-09-04

打赏
举报

换我做的话我会用rabin karp，配上int64的hash。事先把匹配串的所有1mismatch的hash给算出来。一共也就26*m的hash数，而且能线性的求出所有的hash。排个序以便二分搜索。接下来就是算原串所有长度m的子串，然后看是不是在刚才求出的列表里。整个可以做到nlogn。

kmp我一直不算掌握的很好，所以你这个对不对我还真说不上- -b