String searching
grep
grep
defines a
worldview on UNIX:
- Files in text format, not binary.
- Not forced to go through other people's applications.
- Can directly search the files yourself on the command line.
My search engine based on grep
find on DOS.
But how does grep (or other string-finding programs)
work?
Introduction
- String - sequence of 1 or more characters.
- Characters from a defined alphabet.
- Text strings
- e.g. Alphabet is:
- ASCII chars
- ISO 8859
(backward compatible with the printable ASCII chars)
- Unicode
(backward compatible with the printable ASCII chars)
- Binary strings.
- Alphabet is 0 and 1.
- Pattern matching problem
- Find occurrences of search pattern of size m
in text string of size n.
Where
m ≤ n
Naive (brute force) algorithm
string[0] ... string[n-1]
search for:
pattern[0] ... pattern[m-1]
i = 0 // start by trying to match from LHS of string
j = 0
repeat
{
if ( string[i] == pattern[j] )
{
i++
j++
}
else
{
i = i - j + 1 // (*)
j = 0 // reset pattern to start
}
}
until ( (j >= m) or (i >= n) )
if ( j == m ) return match // matches string[i-m] .. string[i-1]
else return no match
|
(*) say i = 50 and we have been successfully matching pattern for a bit
started at i = 50, j = 0
now i = 57, j = 7, when suddenly it fails
go to i = 57 - 7 + 1 = 51 and try with j = 0 from there
Performance of naive algorithm
- Large alphabet (e.g. Alphanumeric)
Imagine if pattern = "STING"
string = "A
STRING
SEARCH
CONSISTING
OF SOME TEXT"
S gets triggered 3 times before start of the match
ST gets triggered 1 time before the match
j++
executed 4 times before match
in typical text search, the if condition above is true only rarely
pointless j++'s (for aborted matches) are few
running time is O(m+n)
- Small alphabet (e.g. Binary)
However, if alphabet is small, this scales worse.
e.g. pattern = 000000001
string = 000000000000000000000000000000000000000000000000000000000000000100
worst case is O(mn)
What does grep use?
-
grep can't use the standard Boyer-Moore algorithm
because grep allows regular expressions
in the pattern.
regular expression
|
matches
|
grep "CO.*STING" |
"COSTING"
"CONSISTING"
"CONSISTING OF A LONG LINE WITH CONSISTING AND STING IN THE MIDDLE AND ENDING IN STING"
|
grep -i "<img.*src=" |
"<img .. (alt, size and any other parameters) .. src="
|
-
"How long is a search string defined by a regular expression?
Obviously the regular expression can represent strings of many sizes and herein lies the problem."
- grep normally uses a finite automaton
to implement the search.
Boyer-Moore is used, for example,
in programming language libraries
for methods without regular expressions, things like:
substring(pattern,string)
- return if pattern is found within string
Links