Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

Search:


Einstein - Search engine

  
Write an offline search engine to search offline web pages in your file system (not on the Web).
The output of the program is that it makes an offline output web page where you can click on links and view them offline.
  

Lab exam

As explained before, in the lab exam, we need extension .sh and call it with "./" in front.
Also for technical reasons we will call it gweb4.


Debug with Shakespeare files

  1. You need some web pages to search.
  2. Download your own copy of the Shakespeare files.


Getting started

Go into your Shakespeare directory.
Make a file called gweb4.sh and, for reasons explained before, put it in the same directory as the Shakespeare files.

Now gweb4.sh could be a program to search all HTML files in and below the current directory, to any number of levels.
But to keep things simple, we will hard-code it to just search */*html
That is how it will be tested on Einstein.

Start with gweb4.sh looking like this:

grep -i "$1" */*html  

Run it like:

 ./gweb4.sh france 
This finds all lines in Shakespeare with the string "france" in any case.
It will output to the command line.
Make sure it works before continuing!
  

Output to web page

Change it to output to the exact offline output web page I asked for above.

exec > [output file]
grep -i "$1" */*html  

When debugging, comment/uncomment that line to switch between output to file and output to command line.
Make sure it works before continuing!

  

Fix line breaks

Now if you double click on that page, you will notice it is a mess. Why? Because HTML does not respect line breaks.
To make HTML respect line breaks, we need to output "<pre>" before the grep and "</pre>" after it.

exec > [output file]
[output pre tag]
grep -i "$1" */*html  
[output end pre tag]

Try it again:

 ./gweb4.sh france 
  

Fix HTML tags

It still looks a mess when you view it in the browser.
Why? Because it has HTML tags in it, and these get interpreted.
We want to just output the HTML tags, not interpret them.
See How to print HTML tags without interpreting them.
So we get:

exec > [output file]
[output pre tag]
grep -i "$1" */*html  |  [sed command]
[output end pre tag]

Try it again:

 ./gweb4.sh france 
  

Make the hits clickable

The output is looking better, but grep output is not clickable. It looks like this:
 file.html: hit 
To make it clickable, change the main line of the script to:

 
grep -i "$1" */*html  |  [sed command]  |  clickable

And then add the following Shell function at the top of the script:

 
clickable()
{
 while read line
 do
  file=` echo "$line" | [cut before the colon] `
   hit=` echo "$line" | [cut after the colon]  `
 
  echo "<a href='$file'>$file</a>: $hit <br>"
 done
}

See How to use cut to parse grep output.
Respect the backquotes.

Try it again:

 ./gweb4.sh france 
  

Final fix of the links

We are nearly there.
We run it and it builds a web page of hits.
But the links do not work. Why? It is all to do with where the output file is and where the files we searched are.
Fix the links by adding an adjustment:

   
  echo "<a href='[some prefix]$file'>$file</a>: $hit <br>"

Test it locally before upload. Make sure it constructs clickable links that work.
You now have your search engine!

  

Upload to Einstein

It works locally, but we still have a problem when we upload it to Einstein for marks.
The problem is your Shakespeare directory does not exist on Einstein.
So when you upload it to Einstein, put this at the top:

 cd /shared/humphrys/shakespeare 

But then to make the links work, you need a different adjustment:

   
  echo "<a href='[some other prefix]$file'>$file</a>: $hit <br>"

Now it should work on Einstein and you get the marks.

  

Using it outside lab exam

In normal life (outside lab exam) you would use this script as follows:
  1. Change script to a simpler name, like "gweb".
  2. Put it in $HOME/bin and make sure PATH is set up.
  3. Inside the script, "cd" to the correct directory where the web pages are.
  4. Output can continue to go to a file in $HOME, or somewhere else. Just make sure the links work.
  5. Now you can be in any directory, and just:
     gweb string 
    and you get an output web page of hits.