Einstein - Search engine
Write an
offline
search engine to search
offline
web pages in your file system,
and produce an
offline
output web page where you can click on links
and view them
offline.
- Usage like:
gweb string
- It searches your files for the string, ignoring case.
- It sends its output into an offline output web page in the $HOME directory called:
gweb.html
(Pay attention to where the output has to be!)
- Double-click on that file to view in web browser.
The idea is that when you view that page, you can click on links to the files.
Einstein test environment
As explained before,
in the special Einstein test environment,
we need extension .sh and need to call it with "./" in front of it.
Also for technical reasons we will call it gweb3.
So the conclusion is that we need a file called gweb3.sh
and we run it like this:
./gweb3.sh string
Debug with Shakespeare files
- You need some web pages to search.
-
Download your own copy of the Shakespeare files.
Getting started
Go into your Shakespeare directory.
Make a file called gweb3.sh and put it in your Shakespeare directory.
(As explained before, this is just for the test environment, so we can run with ./gweb.sh.
Normally we would put programs in $HOME/bin and set up the PATH.)
Start gweb3.sh looking like this:
Run it like:
./gweb3.sh france
This finds all lines in Shakespeare with the string "france" in any case.
It will output to the command line.
Output to web page
Change it to output to the offline web page I asked for above.
exec > [output file]
grep -i "$1" */*html
You can comment/uncomment that line if you want to send output to command line again.
Fix line breaks
Now if you double click on that page, you will notice it is a mess.
Why? Because HTML does not respect line breaks.
To make HTML respect line breaks, we need to output
"<pre>" before the grep
and "</pre>" after it.
exec > [output file]
[output pre tag]
grep -i "$1" */*html
[output end pre tag]
Fix HTML tags
It still looks a mess when you view it in the browser.
Why? Because it has HTML tags in it, and these get interpreted.
We want to just output the HTML tags, not interpret them.
You can see how to do this
here.
So we get:
exec > [output file]
[output pre tag]
grep -i "$1" */*html | [sed command]
[output end pre tag]
Make the hits clickable
The output is looking better,
but grep output is not clickable. It looks like this:
file.html: hit
To make it clickable,
change the main line of the script to:
grep -i "$1" */*html | [sed command] | clickable
And then add the following
Shell function at the top of the script:
clickable()
{
while read line
do
file=` echo "$line" | [cut before the colon] `
hit=` echo "$line" | [cut after the colon] `
echo "<a href='$file'>$file</a>: $hit <br>"
done
}
See How to use cut to parse grep output.
Search my notes for what backquotes mean.
Follow the output format above or Einstein may get confused when marking.
Final fix of the links
We are nearly there.
A stream of grep hits go into clickable.
It extracts the file and hit, and constructs a link to the file.
One link for each hit.
But
(if you put the output file where I told you above)
the links do not work. Why?
Fix the links by adding an adjustment:
echo "<a href='[some prefix]$file'>$file</a>: $hit <br>"
Test it locally before upload.
Make sure it constructs clickable links that work.
You now have your search engine!
Upload to Einstein
Your Shakespeare directory does not exist on Einstein.
So when you upload it to Einstein you need this line before the grep:
cd /shared/humphrys/shakespeare
You still output to the same location I told you.
And you need a different adjustment:
echo "<a href='[some other prefix]$file'>$file</a>: $hit <br>"
Using it in normal life
In normal life (outside of the special lab test)
you would use this script as follows:
- Change script to a simpler name, like "gweb".
- Put it in $HOME/bin and make sure PATH is set up.
- Inside the script, "cd" to the Shakespeare directory before search.
- Now you can be in any directory, and just:
gweb string
and it works.