Motivation Problem: Given a string , find the longest sub string that occurs at least  times.
Brute Force method: For every sub string  of , one can find all the occurrences of  in  by . takes  time, so the total time for this brute force method will be .
A faster solution using hashing: We can binary search the length of the sub string. For a current length  in the binary search, hash of every sub string of length  can be found in  time. While doing this, the hashes can be stored in a dictionary, and when all sub strings of length  are processed, the hash with maximum frequency is to be checked if it has frequency greater than equal to . This takes  time, where a log term comes due to maintaining the dictionary(map in C++).
A solution using Suffix Array:
A Suffix Array is a sorted array of suffixes of a string. Only the indices of suffixes are stored in the string instead of whole strings. For example: Suffix Array of "banana" would look like this:
One naive way to make the suffix array would be to store all suffixes in an array and sort them. If we use an  comparison based sorting algorithm, then the total time to make the suffix array would be , because string comparison takes  time. This is too slow for large strings.
Below is shown an  algorithm that constructs the suffix array. There is an  algorithm and even an  algorithm to construct suffix array, but in a programming contest environment, it is much easier to implement an  algorithm. Also the difference between an  and  algorithm is scarcely noticeable for strings up to length .
The algorithm is based on keeping the ranks of suffixes when the suffixes are sorted by their first  characters in the  step. Therefore we will execute  steps to completely build the suffix array.
It can be easily seen that, comparison of  strings should be optimised, and should be done in better than . It can actually be done and the string comparison of  suffixes can be done in  time. To do this, the fact that  suffixes of the same string are being sorted should be used.
Now suppose that an order relation between the suffixes has been obtained when they are sorted by their first characters. That is,  steps of the algorithm have been done. Now to obtain the order relation in  step, best possible use of order relations in previous steps must be done. Now in the  step, suppose comparison of  suffixes at indices  and  needs to be done. Let us denote the rank of  suffix after  steps by .
Observation: A string of length  can be broken down into  strings of length . If , then  and we know the relation. Else if , then again we know the relation between them. If , then we can obtain the relation between  and  by comparing  and , because the first  characters of the suffixes starting at indices  and  are same as  = . If  and  are also same, then we assign the same rank to both the suffixes.
Therefore at step , to compare  suffixes in  time, a tuple of  integers can be stored for each suffix. Let us name the suffix  and its index be . First integer of tuple that will be stored for  would be , that is the rank of  when it was sorted by first  characters. Second integer of tuple that will be stored would be , that is the rank of suffix starting at index , when it was sorted by the first  characters. This tuple is enough to compare  suffixes in  time as shown above.
It might be possible that  exceeds the string length. In that case some negative number can be assigned to the second integer of tuple of , so that lexicographic order can be maintained. The importance of assigning a negative number to the second integer of tuple can be understood as follows: Let there be  suffixes that are ranked same according to their first  characters and let length of first suffix be greater or equal to  and let length of second suffix be less than . As the rank of these suffixes is same according to their first characters, second suffix should surely come before the first suffix in lexicographical ordering because it is of lesser length. Therefore assigning a negative number to the second integer of tuple can help here.
Here is some pseudo code to construct suffix array.
SA = [] // Suffix Array
P = [][] // P[i][j] denotes rank of suffix at position 'j' when all suffixes are sorted by their first '2^i' characters
str = [] // initial string, 1 based indexing
POWER = [] //array of powers of 2, POWER[i] denotes 2^i
tuple {
    first, second, index;
}
L = [] // Array of Tuples
N = length of str
for i = 1 to N:
    P[0][i] = str[i] - 'a' // Give initial rank when suffixes are sorted by their first 2^0 = 1 character.
step = 1
for i = 1; POWER[i-1]<N; i++, step++:
    for j = 1 to N:
        L[j].index = j
        L[j].first = P[i-1][j]
        L[j].second = (j+POWER[i-1]<=n ? P[i-1][j+POWER[i-1]] : -1)
    sort(L)
    for j = 1 to N:
        P[i][L[j].index] = ((j>1 and L[j].first==L[j-1].first and L[j].second==L[j-1].second) ? P[i][L[j-1].index] : j) 
        /*Assign same rank to suffixes which have same number in the first and second fields of their respective tuples.*/
step = step - 1
Now at the  row of matrix , we have the ranks of all suffixes. Now we can get the suffix array very easily in .
for i = 1 to N:
    SA[P[step][i]] = i
Note: Care must be taken when string length is , in that case if the string is "c", then it will get a rank of ('c'-'a') that is  because we will not enter the for loop. In this case you can manually put the rank as , that is P[0][1]=1instead of P[0][1]=str[1]-'a'.
Often it is required to find the Longest Common Prefix (LCP) of  suffixes. This can be done easily in time by using the array . The following fact is used to find the  of  suffixes starting at indices  and : If P[x][i]==P[x][j], then first  characters starting at indices  and  are same. Below is the pseudo code:
LCP(i,j): //returns the length of LCP of suffixes starting at indices i and j
    if i==j:
        return N-i+1
    return_value=0
    for x = step to 0:
        if P[x][i]==P[x][j]:
            return_value = return_value + POWER[x]
            i = i + POWER[x]
            j = j+ POWER[x]
    return return_value
Now coming to the original problem, to find the longest sub string that occurs at least  times.
First build the suffix array of string . If in the sorted array of suffixes, the  of  suffixes is , then the prefix of length  of all suffixes between these  suffixes is same. Let index of these  suffixes be  and  , then a sub string of length  repeats  times.
To find the solution to motivation problem, one can iterate through all the suffixes in sorted order from  to , and find the  of current suffix and suffix at index  greater than it. This  will repeat at least  times, and the maximum of all these  can be taken. Time complexity: .
Pseudo code:
build Suffix Array
for i = 1 to (N-M+1):
    ans=max(ans, LCP(SA[i],SA[i+M-1]))
Không có nhận xét nào:
Đăng nhận xét