Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


String searching



grep

grep defines a worldview on UNIX:
  1. Files in text format, not binary.
  2. Not forced to go through other people's applications.
  3. Can directly search the files yourself on the command line.
My search engine based on grep

find on DOS.

But how does grep (or other string-finding programs) work?



Introduction



Naive (brute force) algorithm

string[0] ... string[n-1]
search for:
pattern[0] ... pattern[m-1]


i = 0  	// start by trying to match from LHS of string
j = 0

repeat
{
 if ( string[i] == pattern[j] )
 {
  i++
  j++
 }
 else
 {
  i = i - j + 1   	// (*)
  j = 0    			// reset pattern to start
 }
}
until ( (j >= m) or (i >= n) )

if ( j == m ) return match  		// matches string[i-m] .. string[i-1]
 else return no match

(*) say i = 50 and we have been successfully matching pattern for a bit
started at i = 50, j = 0
now i = 57, j = 7, when suddenly it fails
go to i = 57 - 7 + 1 = 51 and try with j = 0 from there


Performance of naive algorithm

  1. Large alphabet (e.g. Alphanumeric)
    Imagine if pattern = "STING"
    string = "A STRING SEARCH CONSISTING OF SOME TEXT"
    S gets triggered 3 times before start of the match
    ST gets triggered 1 time before the match
    j++ executed 4 times before match
    in typical text search, the if condition above is true only rarely
    pointless j++'s (for aborted matches) are few
    running time is O(m+n)

  2. Small alphabet (e.g. Binary)
    However, if alphabet is small, this scales worse.
    e.g. pattern = 000000001
    string = 000000000000000000000000000000000000000000000000000000000000000100
    worst case is O(mn)



Knuth-Morris-Pratt string search algorithm




Boyer-Moore string search algorithm



What does grep use?


Boyer-Moore is used, for example, in programming language libraries for methods without regular expressions, things like:
substring(pattern,string) - return if pattern is found within string



Links



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.