Thứ Sáu, 4 tháng 1, 2019

Suffix Arrays

Motivation Problem: Given a string S, find the longest sub string that occurs at least M times.
Brute Force method: For every sub string X of S, one can find all the occurrences of X in S by KMPKMPtakes O(N) time, so the total time for this brute force method will be O(N3).
A faster solution using hashing: We can binary search the length of the sub string. For a current length X in the binary search, hash of every sub string of length X can be found in O(N) time. While doing this, the hashes can be stored in a dictionary, and when all sub strings of length X are processed, the hash with maximum frequency is to be checked if it has frequency greater than equal to M. This takes O(N(log(N))2) time, where a log term comes due to maintaining the dictionary(map in C++).
A solution using Suffix Array:
A Suffix Array is a sorted array of suffixes of a string. Only the indices of suffixes are stored in the string instead of whole strings. For example: Suffix Array of "banana" would look like this:
5 a
3 ana
1 anana
0 banana
4 na
2 nana
One naive way to make the suffix array would be to store all suffixes in an array and sort them. If we use an O(Nlog(N)) comparison based sorting algorithm, then the total time to make the suffix array would be O(N2logN), because string comparison takes O(N) time. This is too slow for large strings.
Below is shown an O(N(logN)2) algorithm that constructs the suffix array. There is an O(NlogN) algorithm and even an O(N) algorithm to construct suffix array, but in a programming contest environment, it is much easier to implement an O(N(logN)2) algorithm. Also the difference between an O(N(logN)2) and O(NlogN) algorithm is scarcely noticeable for strings up to length 105.
The algorithm is based on keeping the ranks of suffixes when the suffixes are sorted by their first 2k characters in the kth step. Therefore we will execute O(logN) steps to completely build the suffix array.
It can be easily seen that, comparison of 2 strings should be optimised, and should be done in better than O(N). It can actually be done and the string comparison of 2 suffixes can be done in O(1) time. To do this, the fact that 2 suffixes of the same string are being sorted should be used.
Now suppose that an order relation between the suffixes has been obtained when they are sorted by their first 2kcharacters. That is, k steps of the algorithm have been done. Now to obtain the order relation in (k+1)th step, best possible use of order relations in previous steps must be done. Now in the (k+1)th step, suppose comparison of 2 suffixes at indices i and j needs to be done. Let us denote the rank of yth suffix after x steps by Pxy.
Observation: A string of length 2k+1 can be broken down into 2 strings of length 2k. If Pki<Pkj, then P(k+1)i<P(k+1)j and we know the relation. Else if Pki>Pkj, then again we know the relation between them. If Pki=Pkj, then we can obtain the relation between P(k+1)i and P(k+1)j by comparing Pk(i+2k) and Pk(j+2k), because the first 2k characters of the suffixes starting at indices i and j are same as Pki = Pkj. If Pk(i+2k) and Pk(j+2k) are also same, then we assign the same rank to both the suffixes.
Therefore at step (k+1), to compare 2 suffixes in O(1) time, a tuple of 2 integers can be stored for each suffix. Let us name the suffix suf and its index be i. First integer of tuple that will be stored for suf would be Pki, that is the rank of suf when it was sorted by first 2k characters. Second integer of tuple that will be stored would be Pk(i+2k), that is the rank of suffix starting at index (i+2k), when it was sorted by the first 2k characters. This tuple is enough to compare 2 suffixes in O(1) time as shown above.
It might be possible that (i+2k) exceeds the string length. In that case some negative number can be assigned to the second integer of tuple of suf, so that lexicographic order can be maintained. The importance of assigning a negative number to the second integer of tuple can be understood as follows: Let there be 2 suffixes that are ranked same according to their first 2k characters and let length of first suffix be greater or equal to 2k+1 and let length of second suffix be less than 2k+1. As the rank of these suffixes is same according to their first 2kcharacters, second suffix should surely come before the first suffix in lexicographical ordering because it is of lesser length. Therefore assigning a negative number to the second integer of tuple can help here.
Here is some pseudo code to construct suffix array.
SA = [] // Suffix Array

P = [][] // P[i][j] denotes rank of suffix at position 'j' when all suffixes are sorted by their first '2^i' characters

str = [] // initial string, 1 based indexing

POWER = [] //array of powers of 2, POWER[i] denotes 2^i

tuple {
    first, second, index;
}

L = [] // Array of Tuples

N = length of str

for i = 1 to N:
    P[0][i] = str[i] - 'a' // Give initial rank when suffixes are sorted by their first 2^0 = 1 character.

step = 1

for i = 1; POWER[i-1]<N; i++, step++:
    for j = 1 to N:
        L[j].index = j
        L[j].first = P[i-1][j]
        L[j].second = (j+POWER[i-1]<=n ? P[i-1][j+POWER[i-1]] : -1)

    sort(L)

    for j = 1 to N:
        P[i][L[j].index] = ((j>1 and L[j].first==L[j-1].first and L[j].second==L[j-1].second) ? P[i][L[j-1].index] : j) 
        /*Assign same rank to suffixes which have same number in the first and second fields of their respective tuples.*/

step = step - 1
Now at the stepth row of matrix P, we have the ranks of all suffixes. Now we can get the suffix array very easily in O(N).
for i = 1 to N:
    SA[P[step][i]] = i
Note: Care must be taken when string length is 1, in that case if the string is "c", then it will get a rank of ('c'-'a') that is 2 because we will not enter the for loop. In this case you can manually put the rank as 1, that is P[0][1]=1instead of P[0][1]=str[1]-'a'.
Often it is required to find the Longest Common Prefix (LCP) of 2 suffixes. This can be done easily in O(logN)time by using the array P. The following fact is used to find the LCP of 2 suffixes starting at indices i and j: If P[x][i]==P[x][j], then first 2x characters starting at indices i and j are same. Below is the pseudo code:
LCP(i,j): //returns the length of LCP of suffixes starting at indices i and j

    if i==j:

        return N-i+1

    return_value=0

    for x = step to 0:

        if P[x][i]==P[x][j]:

            return_value = return_value + POWER[x]

            i = i + POWER[x]

            j = j+ POWER[x]

    return return_value
Now coming to the original problem, to find the longest sub string that occurs at least M times.
First build the suffix array of string S. If in the sorted array of suffixes, the LCP of 2 suffixes is K, then the prefix of length K of all suffixes between these 2 suffixes is same. Let index of these 2 suffixes be i and j (i<j), then a sub string of length K repeats (ji+1) times.
To find the solution to motivation problem, one can iterate through all the suffixes in sorted order from 0 to (NM+1), and find the LCP of current suffix and suffix at index (M1) greater than it. This LCP will repeat at least M times, and the maximum of all these LCPs can be taken. Time complexity: O(N(logN)2).
Pseudo code:
build Suffix Array

for i = 1 to (N-M+1):
    ans=max(ans, LCP(SA[i],SA[i+M-1]))

Không có nhận xét nào:

Đăng nhận xét

Bài G - Educatioal Round 62

Đề bài: Bạn được cho 1 đồ thị vô hướng đặc biệt. Nó bao gồm $2n$ đỉnh được đánh số từ 1 đến 2n. Dưới đây là một số đặc tính của đồ thị: + ...