Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


Knuth-Morris-Pratt string search algorithm


How to speed up brute force string search

Suppose pattern = 1000011
string = ... 1010001000011 ...
Let's say in our brute force search from LHS of string, we currently have the green bits aligned and have just discovered the mismatch at the red bit = string[i]
We note that pattern[0] is different from pattern[1] .. pattern[4]

but we know that currently:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]

So no point moving right by 1 position and trying to match pattern[0] with string[i-3]
Instead can jump straight to string[i] and start trying to match pattern[0] from there.

To be precise, the prefix "1" of the pattern cannot be found in any of the bit we skip over "000".
Obviously a lot of choices for prefixes of the pattern we could be looking at.



A rule that won't work

A first attempt at a rule to speed up brute force search:

"After a mismatch at string[i], restart comparing the pattern from string[i] onwards."

Why won't it work?
This will fail if beginning of pattern is in the bit we skip over.
Example 1:

pattern = 101001
string = ... 1010100111111
Let B = beginning of pattern that we lose.
B = "10"
Jumps over the match and misses it.

Obviously a lot of choices for prefixes of the pattern we could be looking at.
In above case, we should reset string index to i-2, but is it possible to leave string[i] alone and just change pattern[j]?

Example 2:

pattern = 10101001
string = ... 101010100111111
B = "1010"

Rule fails if beginning of pattern B is in the bit we skip over - This must mean that a beginning part of the pattern C (not necessarily the same beginning part as B!) must repeat twice at start of the pattern (since we matched all of the bit we skipped over).
That is, pattern = C C ..
In both above C = "10"

In Example 1, pattern starts with a repeat:
pattern[0] = pattern[2]
pattern[1] = pattern[3]
and:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]
that is:
string[i-2] == pattern[0]
string[i-1] == pattern[1]
So we can keep i as it is, and reset j from 4 to 2 and try matching from there



No backing up

Leaving i the same means no backing up in the (huge) string to be searched (e.g. a file).
Only changing j position in the (small) pattern.
Don't need an ever-changing buffer of the last m characters of the string.
Just store an array of the pattern at startup, and proceed through the string always forward with no buffering.


Knuth-Morris-Pratt string search algorithm

Start at LHS of string, string[0], trying to match pattern, working right.
Trying to match string[i] == pattern[j].

Given a search pattern, pre-build a table, next[j], showing, when there is a mismatch at pattern position j, where to reset j to.
If match fails, keep i same, reset j to position next[j].



How to build the table

Everything else below is just how to build the table.

Construct a table showing where to reset j to

  1. If mismatch string[i] != pattern[0], just move string to i+1, j = 0
  2. If mismatch string[i] != pattern[1], we leave i the same, j = 0

    pattern = 10
    string = ... 1100000

  3. If mismatch string[i] != pattern[2], we leave i the same, and change j, but we need to consider repeats in pattern[0] .. pattern[1]

    pattern = 110
    string = ... 11100000
    i stays same, j goes from 2 back to 1

    pattern = 100
    string = ... 10100000
    i stays same, j goes from 2 back to 0

  4. If mismatch string[i] != pattern[j], we leave i the same, and change j, but we need to consider repeats in pattern[0] .. pattern[j-1]
Given a certain pattern, construct a table showing where to reset j to.



Construct a table of next[j]

For each j, figure out:
next[j] = length of longest prefix in "pattern[0] .. pattern[j]" that matches the suffix of "pattern[1] .. pattern[j]"
That is:
  1. prefix must include pattern[0]
  2. suffix must include pattern[j]
  3. prefix and suffix are different

Example for pattern "ABABAC":

j 0 1 2 3 4 5
substring 0 to j A AB ABA ABAB ABABA ABABAC
longest prefix-suffix match none none A AB ABA none
next[j] 0 0 1 2 3 0
notes no prefix and suffix that are different
i.e. next[0]=0 for all patterns
         


Given j, let n = next[j]
"pattern[0] .. pattern[n-1]" = "pattern[j-(n-1)] .. pattern[j]"

e.g. j = 4, n = 3,
"pattern[0] .. pattern[2]" = "pattern[2] .. pattern[4]"

If match fails at position j+1, keep i same, reset pattern to position n.
Have already matched pattern[0] .. pattern[n-1]

e.g. We have matched ABABA so far.
If next one fails, say we have matched ABA so far and then see if next one matches.
That is, keep i same, just reset j to 3 (= precisely length of longest prefix-suffix match)
Then, if match after ABA fails too, by the same rule we say we have matched A so far, reset to j = 1, and try again from there.
In other words, it starts by trying to match the longest prefix-suffix, but if that fails it works down to the shorter ones until exhausted (no prefix-suffix matches left).



Algorithm to construct table of next[j]

Do this once, when the pattern comes in.
pattern[0] ... pattern[m-1]
Here, i and j both index pattern.



i = 0

next[i] = 0
i++
j = 0


while ( i < m )
{
 if ( pattern[j] == pattern[i] )
 {
	  next[i] = j+1
	  i++
	  j++
 }
 else ( pattern[j] != pattern[i] )
 {
	 if ( j > 0 )
	  j = next[j-1]
	 
	 else ( j == 0 )
	 {
	  next[i] = 0
	  i++
	  j = 0  			// redundant, just to make it clear what we are looping with
	 }
 }
}


Outputs next[0], next[1], next[2] , .... in that order.



Step-through of table-building algorithm for pattern "ABABAC"

step i j k = (i - j) calculating substring notes output output j
1 0 0 0 next[0] A always = 0 next[0] = 0 0
2 1 0 1 next[1] AB pattern[0] != pattern[1]
failed to match pattern[0] == pattern[1]
next[1] = 0 0
3 2 0 2 next[2] ABA pattern[0] == pattern[2]
this is our first match, pattern[0] == pattern[2]
pattern can be shifted right by k = i = 2
j = 0 means can't be shifted right by anything less than k = i = 2
let's see how long we can do this for
i.e. how long we can start at pattern[0], with ever growing length, and shift right by k
next[2] = 1 1
4 3 1 2 next[3] ABAB pattern[1] == pattern[3]
we have already matched pattern[0] == pattern[2]
so if we can match pattern[1] == pattern[3]
then we get a match of length 2
note that the difference between j and i encodes a memory of:
(a) when this process started (step 3) and (b) how much we are shifting right by (k)
next[3] = 2 2
5 4 2 2 next[4] ABABA pattern[2] == pattern[4]
we have already matched pattern[0] == pattern[2] and pattern[1] == pattern[3]
so if we can match pattern[2] == pattern[4]
then we get a match of length 3
next[4] = 3 3
6 5 3 2 next[5] ABABAC meaning of j:
output j = 3 from last round tells us that:
pattern[0]..pattern[j-1] can be shifted right by k = (i-j) to match pattern[i-j]..pattern[i-1]
pattern[0]..pattern[2] = ABA can be shifted right (by 2) to match pattern[2]..pattern[4]
so only need to check if pattern[3] == pattern[5]
pattern[3] != pattern[5]
so now our shifting right of the ever-growing pattern since step 3 fails
maybe there is another pattern we can shift right though
but this will be a smaller pattern (i.e. we were correct to look for larger one first)

look for smaller pattern than ABA to shift right, ending in pattern[4]
we know ABA itself shifts right to end in pattern[4]
so the smaller pattern must be a prefix of ABA and a suffix of ABA
find the longest one (in j = 2)

  j = next[2] = 1
7 5 1 4     output j = 1 from last round tells us that:
pattern[0] can be shifted right by k = (i-j) to match pattern[i-1]
pattern[0] = A can be shifted right (by 4) to match pattern[4]
so check if pattern[1] == pattern[5]
pattern[1] != pattern[5]
  j = next[0] = 0
8 5 0 5     no history
back to start
check if pattern[0] == pattern[5]
pattern[0] != pattern[5]
next[5] = 0 0






ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.