Boyer-Moore string search algorithm
Example
pattern = "STING"
string = "A STRING SEARCHING EXAMPLE CONSISTING OF TEXT"
- try to match first m characters
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
This fails.
Slide pattern right to look for other matches.
Note that R does not occur in pattern.
So can slide it past R.
You may be starting to guess that this method is for large alphabets
(e.g. human text)
while KMP is good for small alphabets
(where one could rarely ever do this kind of sliding).
-
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
Fails again.
Rightmost character
S is in pattern precisely once,
so slide until two S's line up.
-
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
No C in pattern. Slide past it.
-
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
No space in pattern. Slide past it.
-
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
No P in pattern. Slide past it.
-
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
No O in pattern. Slide past it.
-
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
Rightmost char T.
Exactly one T in pattern.
Slide to align them.
-
STING
A STRING SEARCHING EXAMPLE CONSISTING OF TEXT
match
Skip table
For each character in alphabet
found in the rightmost slot on a mismatch,
this table tells you how many slots to shift the pattern right.
For all characters not in the pattern,
table entry = m
For pattern STING:
If N is rightmost char on mismatch,
shift right by 1
to line the N's up.
For pattern CONSISTING:
C |
O |
N |
S |
I |
S |
T |
I |
N |
G |
9 |
8 |
|
|
|
4 |
3 |
2 |
1 |
10 |
Blank for a letter means the value for this letter is listed in the other copy.
There are two I's.
Shift right by 2 not by 5, to avoid missing possible match.
For pattern DISGUSTING:
D |
I |
S |
G |
U |
S |
T |
I |
N |
G |
9 |
|
|
6 |
5 |
4 |
3 |
2 |
1 |
|
There are two G's.
-
Boyer-Moore string search algorithm
- Complexity is O(n).
The execution time can actually be sub-linear: it doesn't need to actually check every character of the string to be searched but rather skips over some of them
(check right-most character of the block of m first,
if not found in pattern can skip entire rest of block).
- Best-case performance is O(n/m).
In the best case, only one in m characters needs to be checked.
- Actually works better (on average) with longer m!
- Boyer-Moore-Horspool algorithm
is the simplification described above.