Thứ Bảy, 5 tháng 1, 2019

String Searching

String Searching by KMP algorithm (Knuth Morris Pratt algorithm)
Motivation Problem: Given 2 strings P(pattern) and T(text), find the number of occurrences of P in T.
Basic / Brute Force solution:
One obvious and easy to code solution that comes to mind is this: For each index of T, take it as a starting point and find if Ti,i+1,...,i+|P|1 is equal to P.
for i = 0 to length(T)-length(P):

    Found = true

    for j = i to i + length(P) - 1:

        if P[j-i] not equal to T[j]:

            Found = False

    if Found = True 

        answer = answer + 1
This brute force takes O(|P||T|) time in the worst case, which is obviously too slow for large strings.
Knuth Morris Pratt Algorithm:
Suppose for each index i of some string Z, the longest suffix in Z0,1,...,i that is also a prefix of Z0,1,...,i, be known. Formally, a length Fi is known such that Z0,1,...,Fi1 = ZiFi+1,...,i. Let these lengths be stored in array F. The suffix needs to be proper(whole string is not a proper suffix).
Then the solution to the motivation problem can be found as follows:
Define a string V = P + '#' + T, where '#' is a delimiter that is not present in either of P or T. Now, if the above information is known, all occurrences of P in T can be found as follows: If at some index iFi=|P|, then there is an occurrence of Pattern P at position i|P|+1. All such indices from |P|+1 [0 based indexing, the index just after '#'], need to be checked.
The main part of KMP algorithm calculates the array F, which is also called the prefix function. If calculation of For the prefix function can be done efficiently, then we have an efficient solution to the motivation problem. KMP algorithm finds the prefix function in O(lengthofString) time.
To find the prefix function, best possible use of previous values of array F is made, so that calculations aren't done again and again. Suppose all Fi have been calculated, and now Fi+1 is to be calculated. It is to be noted that, value of Fi+1 can be at most 1 greater than Fi. Here is a proof by contradiction:
Suppose Fi+1>Fi+1. Now, if the (i+1)th character is removed, we obtain a suffix ending at index i that is of length Fi+11, which is greater than Fi. This is a contradiction, hence proved.
Observe that if Zi+1=ZFi, then the value of Fi+1=Fi+1. If not, a smaller suffix ending at index i is to be found, that is also a prefix of Z0,1...,i. Let the length of such a suffix be j, then if Zi+1=Zj then Fi+1=j+1. If again the equality doesn't hold true, smaller and smaller suffixes that end at index i, which are also prefixes of Z0,1...,i need to be found.
The only thing remaining is, how to find the length of next smaller suffix ending at index i, that is also a prefix? This is also pretty simple. Observe that due to the property of F, the segment Z0,1,...,Fi1 is equal to the segment ZiFi+1,...,i. So to find the next smaller suffix ending at index i, the longest suffix ending at Fi1 can be found which is FFi1, and this suffix will be the next smaller suffix ending at index i. If this suffix also doesn't satisfy our criteria, then smaller suffixes can be found with the same process, here it will be FFFi11. Note that, if at some point the length becomes 0, the process is stopped.
This completes KMP algorithm. Below is the code:
vector<int> prefix_function (string Z) {

    int n = (int) Z.length();

    vector<int> F (n);

     F[0]=0;

    for (int i=1; i<n; ++i) {

        int j = F[i-1];

        while (j > 0 && Z[i] != Z[j])

            j = F[j-1];

        if (Z[i] == Z[j])  ++j;

        F[i] = j;

    }

    return F;

}
Finding the F array for "ABCABC"
Initially, F0=0.
Index 1F0=0j does not go into while loop and ZjZi, therefore value of Fi=0.
Index 2F1=0j does not go into while loop and ZjZi, therefore value of Fi=0.
Index 3F2=0j does not go into while loop and Zj=Zi, therefore value of Fi=1.
Index 4F3=1j satisfies while loop condition but Zj=Zi, hence does not go into while loop, therefore value of Fi=2.
Index 5F4=2j satisfies while loop condition but Zj=Zi, therefore value of Fi=3.

Không có nhận xét nào:

Đăng nhận xét

Bài G - Educatioal Round 62

Đề bài: Bạn được cho 1 đồ thị vô hướng đặc biệt. Nó bao gồm $2n$ đỉnh được đánh số từ 1 đến 2n. Dưới đây là một số đặc tính của đồ thị: + ...