Thứ Bảy, 5 tháng 1, 2019

Z Algorithm

Z algorithm is a linear time string matching algorithm which runs in O(n) complexity. It is used to find all occurrence of a pattern P in a string S, which is common string searching problem.
Algorithm
Given a string S of length n, the Z Algorithm produces an array Z where Z[i] is the length of the longest substring starting from S[i] which is also a prefix of S, i.e. the maximum k such that S[j]=S[i+j] for all 0j<k. Note that Z[i]=0 means that S[0]S[i]. For easier terminology, let's refer to substrings which are also a prefix as prefix-substrings.
The algorithm relies on a single, crucial invariant. Iterate over the letters in the string (index i from 1 to n1) and maintain an interval [L,R] which is the interval with maximum R such that 1LiR and S[L...R] is a prefix-substring (if no such interval exists, just let L=R=1). For i=1, simply compute L and R by comparing S[0...] to S[1...]Z[1] is obtained during this process.
Now, suppose the correct interval [L,R] for i1 and all of the Z values up to i1. Compute Z[i] and the new [L,R] by the following steps:
  • If i>R, then there does not exist a prefix-substring of S that starts before i and ends at or after i. If such a substring existed, [L,R] would have been the interval for that substring rather than its current value. Thus "reset" and compute a new [L,R] by comparing S[0...] to S[i...] and get Z[i] at the same time (Z[i]=RL+1).
  • Otherwise, iR, so the current [L,R] extends at least to i. Let k=iL. It is known that Z[i]min(Z[k],Ri+1) because S[i...] matches S[k...] for at least Ri+1 characters (they are in the [L,R] interval which is known to be a prefix-substring).
  • If Z[k]<Ri+1, then there is no longer prefix-substring starting at S[i] (or else Z[k] would be larger), meaning Z[i]=Z[k] and [L,R] stays the same. The latter is true because [L,R] only changes if there is a prefix-substring starting at S[i] that extends beyond R, which is not the case here.
  • If Z[k]Ri+1, then it is possible for S[i...] to match S[0...] for more than Ri+1 characters (i.e. past position R). Thus, there's a need to update [L,R] by setting L=i and matching from S[R+1]forward to obtain the new R. Again, Z[i] is obtained during this process.
The process computes all of the Z values in a single pass over the string, so we're done. Correctness is inherent in the algorithm and is pretty intuitively clear.
Time Complexity
The algorithm runs in O(n) time. Characters are never compared at positions less than R, and every time a match is found, R is increased by one, so there are at most n comparisons.
Implementation
Note that the optimization L=R=i is used when S[0]S[i] (it doesn't affect the algorithm since at the next iteration i>R regardless).
int L = 0, R = 0;
for (int i = 1; i < n; i++) 
{
    if (i > R) 
    {
        L = R = i;
        while (R < n && s[R-L] == s[R]) 
        {
            R++;
        }
        z[i] = R-L; 
        R--;
    } 
    else 
    {
        int k = i-L;
        if (z[k] < R-i+1) 
        {
            z[i] = z[k];
        } 
        else 
        {
            L = i;
            while (R < n && s[R-L] == s[R]) 
            {
                R++;
            }
            z[i] = R-L; 
            R--;
        }
    }
}
Applications
One application of the Z Algorithm is for the standard string matching problem of finding matches for a pattern Tof length m in a string S of length n. This can be done in O(n+m) time by using the Z Algorithm on the string TΦS (that is, concatenating TΦ, and S) where Φ is a character that matches nothing. The indices i with Z[i]=m correspond to matches of T in S.

Không có nhận xét nào:

Đăng nhận xét

Bài G - Educatioal Round 62

Đề bài: Bạn được cho 1 đồ thị vô hướng đặc biệt. Nó bao gồm $2n$ đỉnh được đánh số từ 1 đến 2n. Dưới đây là một số đặc tính của đồ thị: + ...