Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:


How to write a search engine in 9 lines of Shell

The search engine for my website started as a server-side CGI script in approximately the following 9 lines of Shell.


#!/bin/sh

echo "Content-type: text/html"
echo

echo '<html> <head> <title> Search results </title> </head> <body>'

argument=`echo "$QUERY_STRING" | sed "s|q=||"`

cd /users/homes/me/public_html

echo '<pre>'
grep -i "$argument" *html */*html		 |    sed -e 's|<|\&lt;|g'   |   sed -e 's|>|\&gt;|g'   
echo '</pre>'


Notes:

  1. This is an online program. It accepts input through HTTP GET from a HTML form.

  2. It assumes input arguments can be read through environment variable $QUERY_STRING

  3. "q=" assumes that the input variable is called "q" in the HTML form.

  4. Your web directories need to be readable for the wildcard to work.

  5. We pipe the result of grep into an ugly-looking sed command. This sed command is needed because there are HTML tags in the results returned by grep. These will be interpreted by your browser, displaying a mess. We need to print the HTML tags without interpreting them. See the following.

  

How to print HTML tags without interpreting them

It is common on a web page to want to print HTML tags without interpreting them.
This is sometimes called "escaping" the HTML tags.
One way is to pipe the text into a sed command that:

  1. converts all   < characters to   &lt;
  2. converts all   > characters to   &gt;

The sed command therefore is:

sed -e 's|<|\&lt;|g'   |   sed -e 's|>|\&gt;|g'
The command is tricky to write because "&" has special meaning to sed and must be escaped.



Some enhancements to the search engine

  1. Change the output so the user can actually click on the pages returned.

  2. Consider where there are spaces in the argument (multiple search words), etc.




Some further enhancements

  1. If you have more than 2 levels of web pages you may write them out explicitly as   */*/*html etc., or get a recursive grep, or use recursive find first to build the filespec:
    cd /users/homes/me/public_html
    
    filespec=`find . -type f -name "*html" | tr '\n' ' '`
    
    grep -i "$argument" $filespec
    
    Since each search will be using the same file list, it would be more efficient to pre-build the list once, and cache it in a file, and then:
    read filespec < filelist.txt
    
    grep -i "$argument" $filespec
    

  2. The pages are not ranked in order of relevance, but only in the order in which grep finds them.
    Not easy to solve.



My search engine started out like this

My search engine started out like the above (plus a C++ input pre-processor for Web input security).

It has since been re-written in PHP, but there is still a grep at the core.

Obviously a heavy-duty search engine would pre-index the files in advance, rather than grep-ing them on the spot. But a grep is perfectly fine for a site of less than, say, 5,000 pages.



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.