Z algorithm is a linear time string matching algorithm which runs in complexity. It is used to find all occurrence of a pattern in a string , which is common string searching problem.
Algorithm
Given a string of length , the Z Algorithm produces an array where is the length of the longest substring starting from which is also a prefix of , i.e. the maximum such that for all . Note that means that . For easier terminology, let's refer to substrings which are also a prefix as prefix-substrings.
The algorithm relies on a single, crucial invariant. Iterate over the letters in the string (index from to ) and maintain an interval which is the interval with maximum such that and is a prefix-substring (if no such interval exists, just let ). For , simply compute and by comparing to . is obtained during this process.
Now, suppose the correct interval for and all of the values up to . Compute and the new by the following steps:
- If , then there does not exist a prefix-substring of that starts before and ends at or after . If such a substring existed, would have been the interval for that substring rather than its current value. Thus "reset" and compute a new by comparing to and get at the same time ().
- Otherwise, , so the current extends at least to . Let . It is known that because matches for at least characters (they are in the interval which is known to be a prefix-substring).
- If , then there is no longer prefix-substring starting at (or else would be larger), meaning and stays the same. The latter is true because only changes if there is a prefix-substring starting at that extends beyond , which is not the case here.
- If , then it is possible for to match for more than characters (i.e. past position ). Thus, there's a need to update by setting and matching from forward to obtain the new . Again, is obtained during this process.
The process computes all of the values in a single pass over the string, so we're done. Correctness is inherent in the algorithm and is pretty intuitively clear.
Time Complexity
The algorithm runs in time. Characters are never compared at positions less than , and every time a match is found, is increased by one, so there are at most comparisons.
Implementation
Note that the optimization is used when (it doesn't affect the algorithm since at the next iteration regardless).
int L = 0, R = 0;
for (int i = 1; i < n; i++)
{
if (i > R)
{
L = R = i;
while (R < n && s[R-L] == s[R])
{
R++;
}
z[i] = R-L;
R--;
}
else
{
int k = i-L;
if (z[k] < R-i+1)
{
z[i] = z[k];
}
else
{
L = i;
while (R < n && s[R-L] == s[R])
{
R++;
}
z[i] = R-L;
R--;
}
}
}
Applications
One application of the Z Algorithm is for the standard string matching problem of finding matches for a pattern of length in a string of length . This can be done in time by using the Z Algorithm on the string (that is, concatenating , , and ) where is a character that matches nothing. The indices with correspond to matches of in .
Không có nhận xét nào:
Đăng nhận xét