String Searching by KMP algorithm (Knuth Morris Pratt algorithm)
Motivation Problem: Given strings (pattern) and (text), find the number of occurrences of in .
Basic / Brute Force solution:
One obvious and easy to code solution that comes to mind is this: For each index of , take it as a starting point and find if is equal to .
for i = 0 to length(T)-length(P):
Found = true
for j = i to i + length(P) - 1:
if P[j-i] not equal to T[j]:
Found = False
if Found = True
answer = answer + 1
This brute force takes time in the worst case, which is obviously too slow for large strings.
Knuth Morris Pratt Algorithm:
Suppose for each index of some string , the longest suffix in that is also a prefix of , be known. Formally, a length is known such that = . Let these lengths be stored in array . The suffix needs to be proper(whole string is not a proper suffix).
Then the solution to the motivation problem can be found as follows:
Define a string , where is a delimiter that is not present in either of or . Now, if the above information is known, all occurrences of in can be found as follows: If at some index , , then there is an occurrence of Pattern at position . All such indices from [0 based indexing, the index just after '#'], need to be checked.
The main part of KMP algorithm calculates the array , which is also called the prefix function. If calculation of or the prefix function can be done efficiently, then we have an efficient solution to the motivation problem. KMP algorithm finds the prefix function in time.
To find the prefix function, best possible use of previous values of array is made, so that calculations aren't done again and again. Suppose all have been calculated, and now is to be calculated. It is to be noted that, value of can be at most 1 greater than . Here is a proof by contradiction:
Suppose . Now, if the character is removed, we obtain a suffix ending at index that is of length , which is greater than . This is a contradiction, hence proved.
Observe that if , then the value of . If not, a smaller suffix ending at index is to be found, that is also a prefix of . Let the length of such a suffix be , then if then . If again the equality doesn't hold true, smaller and smaller suffixes that end at index , which are also prefixes of need to be found.
The only thing remaining is, how to find the length of next smaller suffix ending at index , that is also a prefix? This is also pretty simple. Observe that due to the property of , the segment is equal to the segment . So to find the next smaller suffix ending at index , the longest suffix ending at can be found which is , and this suffix will be the next smaller suffix ending at index . If this suffix also doesn't satisfy our criteria, then smaller suffixes can be found with the same process, here it will be . Note that, if at some point the length becomes , the process is stopped.
This completes algorithm. Below is the code:
vector<int> prefix_function (string Z) {
int n = (int) Z.length();
vector<int> F (n);
F[0]=0;
for (int i=1; i<n; ++i) {
int j = F[i-1];
while (j > 0 && Z[i] != Z[j])
j = F[j-1];
if (Z[i] == Z[j]) ++j;
F[i] = j;
}
return F;
}
Finding the array for "ABCABC"
Initially, .
Index , does not go into while loop and , therefore value of .
Index , does not go into while loop and , therefore value of .
Index , does not go into while loop and , therefore value of .
Index , satisfies while loop condition but , hence does not go into while loop, therefore value of .
Index , satisfies while loop condition but , therefore value of .
Không có nhận xét nào:
Đăng nhận xét