Wednesday, April 6, 2016

KMP String pattern search algorithm (KMP 字符串匹配算法)

The original post is provided by: JBoxer
In this post, I will provide the explanation of KMP using both English and Chinese.

The Partial Match Table

The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.
Here’s the partial match table for the pattern “abababca”:
1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.
Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.
Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.
Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.
With this in mind, I can now give the one-sentence meaning of the values in the partial match table:
The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.
Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.
Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.
Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.
Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.
Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.

How to use the Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:
If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.
Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:
1
2
3
char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |
The first time we get a partial match is here:
1
2
3
bacbababaabcbab
 |
 abababca
This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:
1
2
3
bacbababaabcbab
    |||||
    abababca
This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:
1
2
3
4
5
// x denotes a skip

bacbababaabcbab
    xx|||
      abababca
This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:
1
2
3
4
5
// x denotes a skip

bacbababaabcbab
      xx|
        abababca
At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.
中文:
部分匹配表

毫无疑问,KMP算法的精髓是部分匹配表。我理解KMP算法时,最大的障碍就在于是否充分明白部分匹配表里的值所代表的意义。下面我会尽可能简单地来解释这些。

下面这个是“abababca”这个模板的部分匹配表:

char: | a | b | a | b | a | b | c | a |

index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |

value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

如果我有一个8个字符的模板(这里我们就用“abababca”来举例子),我的部分匹配表将会有8格。如果此时此刻我正匹配模板的第8格即最后一格,那意味着我匹配了整个模板(“abababca”);如果我正匹配模板的第7格,则意味着当前仅匹配了整个模板的前7位(“abababc”),此时第8位(“a”)是无关的,不用去管它;如果我此时此刻正匹配模板的第6格,那意味着……看到这里你应该已经明白我的意思了。目前我还没有提到部分匹配表每格数据的含义,在这里仅仅是交代了大概。

现在,为了说明刚刚提到的每格数据的含义,我们首先要明白什么是最优前缀什么是最优后缀。

最优前缀:一个字符串中,去除一个或多个尾部的字符所得的新字符串就是最优前缀。例如 “S”、 “Sn”、 “Sna”、 “Snap”都是“Snape”的最优前缀。

最优后缀:一个字符串中,去除一个或多个首部的字符所得的新字符串就是最优后缀。例如“agrid”、 “grid”、“rid”、 “id”、“d”都是 “Hagrid”的最优后缀。

有了两个概念,我现在可以用一句话来概括部分匹配表里每列数据的含义了:

模板(子模板)中,既是最优前缀也是最优后缀的最长字符串的长度。

下面我举例说明一下这句话。我们来看部分匹配表的第3格数据,如果你还记得我在前面提到的,这意味着我们目前仅仅关心前3个字母(“aba”)。在“aba”这个子模板中,有两个最优前缀(“a”和“ab”)和两个最优后缀(“a”和“ba”)。其中,最优前缀“ab”并不是最优后缀。因此,最优前缀与最优后缀中,相同的只有“a”。那么,此时此刻既是最优前缀也是最优后缀的最长字符串的长度就是1了。

我们再来试试第4格,我们应该是关注于前4个字母(“abab”)。可以看出,有3个最优前缀(“a”、“ab”、 “aba”)和3个最优后缀(“b”、“ab”、“bab”)。这一次 “ab” 既是最优前缀也是最优后缀,并且长度为2,因此,部分匹配表的第4格值为2。

这是很有趣的例子,我们再看看第5格的情况,也就是考虑“ababa”。我们有4个最优前缀(“a”、 “ab”、“aba”,和“abab”)和4个最优后缀(“a”、 “ba”、“aba”,和“baba”)。现在,有两个匹配“a”和“aba” 既是最优前缀也是最优后缀,而“aba”比“a”要长,所以部分匹配表的第5格值为3。

跳过中间的直接来看第7格,此时只考虑字母“abababc”。即使不一一枚举出所有的最优前缀与最优后缀也不难看出,这两个集合之间不会有任何的交集。因为,所有最优后缀都以“c”结尾,但没有任何最优前缀是以“c”结尾的,所以没有相匹配的,因此第7格值为0。

最后,让我们看看第8格,也就是考虑整个模板(abababca)。它的最优前缀与最有后缀都以“a”开头以“a”结尾,所以第8列的值至少是1。然而1就是最终结果了,所有长度大于等于2的最优后缀都包含“c”,但只有“abababc”这一个最优前缀包含“c”,这个7位的最优后缀“bababca”并不匹配,所以第8列最终赋值为1。


如何使用部分匹配表

当我们找到了部分匹配的字符串时,可以用部分匹配表里的值来跳过前面一些字符(而不是重复进行没有必要的比较)。具体是这样工作的:

如果已经匹配到的部分字符串的长度为partial_match_length且 table[partial_match_length] > 1,那么我们可以跳过partial_match_length- table[partial_match_length - 1]个字符。

比如,我们拿“abababca”来这个模板来匹配文本“ bacbababaabcbab”的话,我们的部分匹配表应该是这样的:

char: | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

第一次匹配的时候是在这里

bacbababaabcbab
|
abababca

partial_match_length值为1,对应的table[partial_match_length - 1] (即table[0])值为0。所以,这种情况下我们不能跳过任何字符。下一次的匹配是这里:

bacbababaabcbab
| | | | |
abababca

partial_match_length值为5,对应的 table[partial_match_length - 1] (即 table[4])值为3。这意味着我们可以跳过 partial_match_length- table[partial_match_length - 1] (即 5 – table[4] 或5 – 3 亦即 2)个字符:

// x 表示一个跳过

bacbababaabcbab
xx | | |
abababca

partial_match_length值为3,对应的 table[partial_match_length - 1] (即 table[2])值为1,这意味着我们可以跳过 partial_match_length- table[partial_match_length - 1] (即 3- table[2] 或3 – 1亦即 2)个字符:

// x 表示一个跳过

bacbababaabcbab
xx |
abababca

现在,模板长度大于所剩余的目标字符串长度,所以我们知道不会再有匹配了。

No comments:

Post a Comment