Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


Einstein - Search engine

  
Write an offline search engine to search offline web pages in your file system, and produce an offline output web page where you can click on links and view them offline.
  

Einstein test environment

As explained before, in the special Einstein test environment, we need extension .sh and need to call it with "./" in front of it.
Also for technical reasons we will call it gweb3.
So the conclusion is that we need a file called gweb3.sh and we run it like this:
 ./gweb3.sh string  


Debug with Shakespeare files

  1. You need some web pages to search.
  2. Download your own copy of the Shakespeare files.


Getting started

Go into your Shakespeare directory.
Make a file called gweb3.sh and put it in your Shakespeare directory. (As explained before, this is just for the test environment, so we can run with ./gweb.sh. Normally we would put programs in $HOME/bin and set up the PATH.)

Start gweb3.sh looking like this:

grep -i "$1" */*html  

Run it like:

 ./gweb3.sh france 
This finds all lines in Shakespeare with the string "france" in any case.
It will output to the command line.
  

Output to web page

Change it to output to the offline web page I asked for above.

exec > [output file]
grep -i "$1" */*html  

You can comment/uncomment that line if you want to send output to command line again.

  

Fix line breaks

Now if you double click on that page, you will notice it is a mess. Why? Because HTML does not respect line breaks.
To make HTML respect line breaks, we need to output "<pre>" before the grep and "</pre>" after it.

exec > [output file]
[output pre tag]
grep -i "$1" */*html  
[output end pre tag]

  

Fix HTML tags

It still looks a mess when you view it in the browser.
Why? Because it has HTML tags in it, and these get interpreted.
We want to just output the HTML tags, not interpret them.
You can see how to do this here.
So we get:

exec > [output file]
[output pre tag]
grep -i "$1" */*html  |  [sed command]
[output end pre tag]

  

Make the hits clickable

The output is looking better, but grep output is not clickable. It looks like this:
 file.html: hit 
To make it clickable, change the main line of the script to:

 
grep -i "$1" */*html  |  [sed command]  |  clickable

And then add the following Shell function at the top of the script:

 
clickable()
{
 while read line
 do
  file=` echo "$line" | [cut before the colon] `
   hit=` echo "$line" | [cut after the colon]  `
 
  echo "<a href='$file'>$file</a>: $hit <br>"
 done
}

See How to use cut to parse grep output.
Search my notes for what backquotes mean.
Follow the output format above or Einstein may get confused when marking.

  

Final fix of the links

We are nearly there. A stream of grep hits go into clickable. It extracts the file and hit, and constructs a link to the file. One link for each hit.
But (if you put the output file where I told you above) the links do not work. Why?
Fix the links by adding an adjustment:

   
  echo "<a href='[some prefix]$file'>$file</a>: $hit <br>"

Test it locally before upload. Make sure it constructs clickable links that work.
You now have your search engine!

  

Upload to Einstein

Your Shakespeare directory does not exist on Einstein. So when you upload it to Einstein you need this line before the grep:

 cd /shared/humphrys/shakespeare 

You still output to the same location I told you.
And you need a different adjustment:

   
  echo "<a href='[some other prefix]$file'>$file</a>: $hit <br>"

  

Using it in normal life

In normal life (outside of the special lab test) you would use this script as follows:
  1. Change script to a simpler name, like "gweb".
  2. Put it in $HOME/bin and make sure PATH is set up.
  3. Inside the script, "cd" to the Shakespeare directory before search.
  4. Now you can be in any directory, and just:
     gweb string 
    and it works.