No pain, No gain !!!: Suffix Arrays

Motivation Problem: Given a string

S

, find the longest sub string that occurs at least

M

times.

Brute Force method: For every sub string

X

S

, one can find all the occurrences of

X

S

K M P

K M P

takes

O (N)

time, so the total time for this brute force method will be

O (N^{3})

A faster solution using hashing: We can binary search the length of the sub string. For a current length

X

in the binary search, hash of every sub string of length

X

can be found in

O (N)

time. While doing this, the hashes can be stored in a dictionary, and when all sub strings of length

X

are processed, the hash with maximum frequency is to be checked if it has frequency greater than equal to

M

. This takes

O (N (l o g (N))^{2})

time, where a log term comes due to maintaining the dictionary(map in C++).

A solution using Suffix Array:

A Suffix Array is a sorted array of suffixes of a string. Only the indices of suffixes are stored in the string instead of whole strings. For example: Suffix Array of "banana" would look like this:

5 \to

a

3 \to

a n a

1 \to

a n a n a

0 \to

b a n a n a

4 \to

n a

2 \to

n a n a

One naive way to make the suffix array would be to store all suffixes in an array and sort them. If we use an

O (N l o g (N))

comparison based sorting algorithm, then the total time to make the suffix array would be

O (N^{2} l o g N)

, because string comparison takes

O (N)

time. This is too slow for large strings.

Below is shown an

O (N (l o g N)^{2})

algorithm that constructs the suffix array. There is an

O (N l o g N)

algorithm and even an

O (N)

algorithm to construct suffix array, but in a programming contest environment, it is much easier to implement an

O (N (l o g N)^{2})

algorithm. Also the difference between an

O (N (l o g N)^{2})

and

O (N l o g N)

algorithm is scarcely noticeable for strings up to length

10^{5}

The algorithm is based on keeping the ranks of suffixes when the suffixes are sorted by their first

2^{k}

characters in the

k^{t h}

step. Therefore we will execute

O (l o g N)

steps to completely build the suffix array.

It can be easily seen that, comparison of

2

strings should be optimised, and should be done in better than

O (N)

. It can actually be done and the string comparison of

2

suffixes can be done in

O (1)

time. To do this, the fact that

2

suffixes of the same string are being sorted should be used.

Now suppose that an order relation between the suffixes has been obtained when they are sorted by their first

2^{k}

characters. That is,

k

steps of the algorithm have been done. Now to obtain the order relation in

(k + 1)^{t h}

step, best possible use of order relations in previous steps must be done. Now in the

(k + 1)^{t h}

step, suppose comparison of

2

suffixes at indices

i

and

j

needs to be done. Let us denote the rank of

y^{t h}

suffix after

x

steps by

P_{x y}

Observation: A string of length

2^{k + 1}

can be broken down into

2

strings of length

2^{k}

. If

P_{k i} < P_{k j}

, then

P_{(k + 1) i} < P_{(k + 1) j}

and we know the relation. Else if

P_{k i} > P_{k j}

, then again we know the relation between them. If

P_{k i} = P_{k j}

, then we can obtain the relation between

P_{(k + 1) i}

and

P_{(k + 1) j}

by comparing

P_{k (i + 2^{k})}

and

P_{k (j + 2^{k})}

, because the first

2^{k}

characters of the suffixes starting at indices

i

and

j

are same as

P_{k i}

P_{k j}

. If

P_{k (i + 2^{k})}

and

P_{k (j + 2^{k})}

are also same, then we assign the same rank to both the suffixes.

Therefore at step

(k + 1)

, to compare

2

suffixes in

O (1)

time, a tuple of

2

integers can be stored for each suffix. Let us name the suffix

s u f

and its index be

i

. First integer of tuple that will be stored for

s u f

would be

P_{k i}

, that is the rank of

s u f

when it was sorted by first

2^{k}

characters. Second integer of tuple that will be stored would be

P_{k (i + 2^{k})}

, that is the rank of suffix starting at index

(i + 2^{k})

, when it was sorted by the first

2^{k}

characters. This tuple is enough to compare

2

suffixes in

O (1)

time as shown above.

It might be possible that

(i + 2^{k})

exceeds the string length. In that case some negative number can be assigned to the second integer of tuple of

s u f

, so that lexicographic order can be maintained. The importance of assigning a negative number to the second integer of tuple can be understood as follows: Let there be

2

suffixes that are ranked same according to their first

2^{k}

characters and let length of first suffix be greater or equal to

2^{k + 1}

and let length of second suffix be less than

2^{k + 1}

. As the rank of these suffixes is same according to their first

2^{k}

characters, second suffix should surely come before the first suffix in lexicographical ordering because it is of lesser length. Therefore assigning a negative number to the second integer of tuple can help here.

Here is some pseudo code to construct suffix array.

SA = [] // Suffix Array

P = [][] // P[i][j] denotes rank of suffix at position 'j' when all suffixes are sorted by their first '2^i' characters

str = [] // initial string, 1 based indexing

POWER = [] //array of powers of 2, POWER[i] denotes 2^i

tuple {
    first, second, index;
}

L = [] // Array of Tuples

N = length of str

for i = 1 to N:
    P[0][i] = str[i] - 'a' // Give initial rank when suffixes are sorted by their first 2^0 = 1 character.

step = 1

for i = 1; POWER[i-1]<N; i++, step++:
    for j = 1 to N:
        L[j].index = j
        L[j].first = P[i-1][j]
        L[j].second = (j+POWER[i-1]<=n ? P[i-1][j+POWER[i-1]] : -1)

    sort(L)

    for j = 1 to N:
        P[i][L[j].index] = ((j>1 and L[j].first==L[j-1].first and L[j].second==L[j-1].second) ? P[i][L[j-1].index] : j) 
        /*Assign same rank to suffixes which have same number in the first and second fields of their respective tuples.*/

step = step - 1

Now at the

s t e p^{t h}

row of matrix

P

, we have the ranks of all suffixes. Now we can get the suffix array very easily in

O (N)

for i = 1 to N:
    SA[P[step][i]] = i

Note: Care must be taken when string length is

1

, in that case if the string is "c", then it will get a rank of ('c'-'a') that is

2

because we will not enter the for loop. In this case you can manually put the rank as

1

, that is P[0][1]=1instead of P[0][1]=str[1]-'a'.

Often it is required to find the Longest Common Prefix (LCP) of

2

suffixes. This can be done easily in

O (l o g N)

time by using the array

P

. The following fact is used to find the

L C P

2

suffixes starting at indices

i

and

j

: If P[x][i]==P[x][j], then first

2^{x}

characters starting at indices

i

and

j

are same. Below is the pseudo code:

LCP(i,j): //returns the length of LCP of suffixes starting at indices i and j

    if i==j:

        return N-i+1

    return_value=0

    for x = step to 0:

        if P[x][i]==P[x][j]:

            return_value = return_value + POWER[x]

            i = i + POWER[x]

            j = j+ POWER[x]

    return return_value

Now coming to the original problem, to find the longest sub string that occurs at least

M

times.

First build the suffix array of string

S

. If in the sorted array of suffixes, the

L C P

2

suffixes is

K

, then the prefix of length

K

of all suffixes between these

2

suffixes is same. Let index of these

2

suffixes be

i

and

j

(i < j)

, then a sub string of length

K

repeats

(j - i + 1)

times.

To find the solution to motivation problem, one can iterate through all the suffixes in sorted order from

0

(N - M + 1)

, and find the

L C P

of current suffix and suffix at index

(M - 1)

greater than it. This

L C P

will repeat at least

M

times, and the maximum of all these

L C P s

can be taken. Time complexity:

O (N (l o g N)^{2})

Pseudo code:

build Suffix Array

for i = 1 to (N-M+1):
    ans=max(ans, LCP(SA[i],SA[i+M-1]))

No pain, No gain !!!

Thứ Sáu, 4 tháng 1, 2019

Suffix Arrays

Không có nhận xét nào:

Đăng nhận xét

Bài G - Educatioal Round 62

Báo cáo vi phạm

Nhãn