String searching

grep

grep defines a worldview on UNIX:

Files in text format, not binary.
Not forced to go through other people's applications.
Can directly search the files yourself on the command line.

My search engine based on grep

find on DOS.

But how does grep (or other string-finding programs) work?

Introduction

String - sequence of 1 or more characters.
Characters from a defined alphabet.
Text strings - e.g. Alphabet is:
- ASCII chars
- ISO 8859 (backward compatible with the printable ASCII chars)
- Unicode (backward compatible with the printable ASCII chars)
Binary strings. - Alphabet is 0 and 1.
Pattern matching problem - Find occurrences of search pattern of size m in text string of size n. Where m ≤ n

Naive (brute force) algorithm

string[0] ... string[n-1]
search for:
pattern[0] ... pattern[m-1]

i = 0 // start by trying to match from LHS of string j = 0 repeat { if ( string[i] == pattern[j] ) { i++ j++ } else { i = i - j + 1 // (*) j = 0 // reset pattern to start } } until ( (j >= m) or (i >= n) ) if ( j == m ) return match // matches string[i-m] .. string[i-1] else return no match

(*) say i = 50 and we have been successfully matching pattern for a bit
started at i = 50, j = 0
now i = 57, j = 7, when suddenly it fails
go to i = 57 - 7 + 1 = 51 and try with j = 0 from there

Performance of naive algorithm

Large alphabet (e.g. Alphanumeric)
Imagine if pattern = "STING"
string = "A STRING SEARCH CONSISTING OF SOME TEXT"
S gets triggered 3 times before start of the match
ST gets triggered 1 time before the match
j++ executed 4 times before match
in typical text search, the if condition above is true only rarely
pointless j++'s (for aborted matches) are few
running time is O(m+n)
- Big O notation
Small alphabet (e.g. Binary)
However, if alphabet is small, this scales worse.
e.g. pattern = 000000001
string = 000000000000000000000000000000000000000000000000000000000000000100
worst case is O(mn)

Knuth-Morris-Pratt string search algorithm

Boyer-Moore string search algorithm

What does grep use?

grep can't use the standard Boyer-Moore algorithm because grep allows regular expressions in the pattern.

regular expression matches

grep "CO.*STING" "COSTING"
"CONSISTING"
"CONSISTING OF A LONG LINE WITH CONSISTING AND STING IN THE MIDDLE AND ENDING IN STING"

grep -i "<img.*src=" "<img .. (alt, size and any other parameters) .. src="

"How long is a search string defined by a regular expression? Obviously the regular expression can represent strings of many sizes and herein lies the problem."
grep normally uses a finite automaton to implement the search.

Boyer-Moore is used, for example, in programming language libraries for methods without regular expressions, things like:
substring(pattern,string) - return if pattern is found within string

regular expression	matches
grep "CO.*STING"	"COSTING" "CONSISTING" "CONSISTING OF A LONG LINE WITH CONSISTING AND STING IN THE MIDDLE AND ENDING IN STING"
grep -i "<img.*src="	"<img .. (alt, size and any other parameters) .. src="