School of Computing. Dublin City University.
Online coding site: Ancient Brain
coders JavaScript worlds
but we know that currently:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]
So no point moving right by 1 position and trying to match pattern[0] with string[i-3]
Instead can jump straight to string[i] and start trying to match pattern[0] from there.
To be precise, the prefix "1" of the pattern
cannot be found in any of the bit we skip over "000".
Obviously a lot of choices for prefixes of the pattern we could be looking at.
"After a mismatch at string[i], restart comparing the pattern from string[i] onwards."
Why won't it work?
This will fail if beginning of pattern is in the bit we skip over.
Example 1:
pattern = 101001
string = ... 1010100111111
Let B = beginning of pattern that we lose.
B = "10"
Jumps over the match and misses it.
Obviously a lot of choices for prefixes of the pattern we could be looking at.
In above case, we should reset string index to i-2, but
is it possible to leave string[i] alone and just change
pattern[j]?
Example 2:
pattern = 10101001
string = ... 101010100111111
B = "1010"
Rule fails if
beginning of pattern B is in the bit we skip over
- This must mean that a beginning part of the pattern C
(not necessarily the same beginning part as B!)
must repeat twice at start of the pattern
(since we matched all of the bit we skipped over).
That is, pattern = C C ..
In both above C = "10"
In Example 1,
pattern starts with a repeat:
pattern[0] = pattern[2]
pattern[1] = pattern[3]
and:
string[i-4] == pattern[0]
string[i-3] == pattern[1]
string[i-2] == pattern[2]
string[i-1] == pattern[3]
that is:
string[i-2] == pattern[0]
string[i-1] == pattern[1]
So we can keep i as it is, and reset j from 4 to 2 and try matching from there
Given a search pattern, pre-build
a table, next[j], showing,
when there is a mismatch at pattern position j,
where to reset j to.
If match fails,
keep i same,
reset j to position next[j].
pattern = 10
string = ...
1100000
pattern = 110
string = ...
11100000
i stays same,
j goes from 2 back to 1
pattern = 100
string = ...
10100000
i stays same,
j goes from 2 back to 0
Example for pattern "ABABAC":
j | 0 | 1 | 2 | 3 | 4 | 5 |
substring 0 to j | A | AB | ABA | ABAB | ABABA | ABABAC |
longest prefix-suffix match | none | none | A | AB | ABA | none |
next[j] | 0 | 0 | 1 | 2 | 3 | 0 |
notes |
no prefix and suffix that are different i.e. next[0]=0 for all patterns |
Given j, let n = next[j]
"pattern[0] .. pattern[n-1]"
= "pattern[j-(n-1)] .. pattern[j]"
e.g. j = 4, n = 3,
"pattern[0] .. pattern[2]"
= "pattern[2] .. pattern[4]"
If match fails at position j+1,
keep i same,
reset pattern to position n.
Have already matched pattern[0] .. pattern[n-1]
e.g. We have matched ABABA so far.
If next one fails, say we have matched ABA so far and then see if next one matches.
That is, keep i same, just reset j to 3
(= precisely length of longest prefix-suffix match)
Then, if match after ABA fails too, by the same rule
we say we have matched A so far, reset to j = 1, and try again from there.
In other words, it starts by trying to match the longest prefix-suffix,
but if that fails it works down to the shorter ones until exhausted
(no prefix-suffix matches left).
i = 0 next[i] = 0 i++ j = 0 while ( i < m ) { if ( pattern[j] == pattern[i] ) { next[i] = j+1 i++ j++ } else ( pattern[j] != pattern[i] ) { if ( j > 0 ) j = next[j-1] else ( j == 0 ) { next[i] = 0 i++ j = 0 // redundant, just to make it clear what we are looping with } } } |
Outputs next[0], next[1], next[2] , .... in that order.
step | i | j | k = (i - j) | calculating | substring | notes | output | output j |
1 | 0 | 0 | 0 | next[0] | A | always = 0 | next[0] = 0 | 0 |
2 | 1 | 0 | 1 | next[1] | AB | pattern[0] != pattern[1] failed to match pattern[0] == pattern[1] |
next[1] = 0 | 0 |
3 | 2 | 0 | 2 | next[2] | ABA | pattern[0] == pattern[2] this is our first match, pattern[0] == pattern[2] pattern can be shifted right by k = i = 2 j = 0 means can't be shifted right by anything less than k = i = 2 let's see how long we can do this for i.e. how long we can start at pattern[0], with ever growing length, and shift right by k |
next[2] = 1 | 1 |
4 | 3 | 1 | 2 | next[3] | ABAB | pattern[1] == pattern[3]
we have already matched pattern[0] == pattern[2] so if we can match pattern[1] == pattern[3] then we get a match of length 2 note that the difference between j and i encodes a memory of: (a) when this process started (step 3) and (b) how much we are shifting right by (k) |
next[3] = 2 | 2 |
5 | 4 | 2 | 2 | next[4] | ABABA | pattern[2] == pattern[4]
we have already matched pattern[0] == pattern[2] and pattern[1] == pattern[3] so if we can match pattern[2] == pattern[4] then we get a match of length 3 |
next[4] = 3 | 3 |
6 | 5 | 3 | 2 | next[5] | ABABAC |
meaning of j:
output j = 3 from last round tells us that: pattern[0]..pattern[j-1] can be shifted right by k = (i-j) to match pattern[i-j]..pattern[i-1] pattern[0]..pattern[2] = ABA can be shifted right (by 2) to match pattern[2]..pattern[4] so only need to check if pattern[3] == pattern[5] pattern[3] != pattern[5] so now our shifting right of the ever-growing pattern since step 3 fails maybe there is another pattern we can shift right though but this will be a smaller pattern (i.e. we were correct to look for larger one first)
look for smaller pattern than ABA to shift right,
ending in pattern[4]
|
j = next[2] = 1 | |
7 | 5 | 1 | 4 |
output j = 1 from last round tells us that: pattern[0] can be shifted right by k = (i-j) to match pattern[i-1] pattern[0] = A can be shifted right (by 4) to match pattern[4] so check if pattern[1] == pattern[5] pattern[1] != pattern[5] |
j = next[0] = 0 | |||
8 | 5 | 0 | 5 |
no history back to start check if pattern[0] == pattern[5] pattern[0] != pattern[5] |
next[5] = 0 | 0 |