Einstein - Search engine
Write an
offline
search engine to search
offline web pages in your file system (not on the Web).
The output of the program is that it makes an
offline
output web page where you can click on links
and view them
offline.
- Usage would be like:
gweb string
- It searches your files for the string, ignoring case.
- It outputs to some offline web page.
- Double-click on that file to view in web browser.
When you view that page, you can click on links to see the files.
Lab exam
As explained before,
in the lab exam,
we need extension .sh and call it with "./" in front.
Also for technical reasons we will call it gweb4.
-
So we need a file called gweb4.sh
and run it like this:
./gweb4.sh string
- On Einstein, we will be testing it on files in a directory you do not own.
So we better not send output to the current directory.
- For this test, Send output to
a file called gweb.html in your $HOME directory.
(Use "$HOME" as the directory name.)
Debug with Shakespeare files
- You need some web pages to search.
-
Download your own copy of the Shakespeare files.
Getting started
Go into your Shakespeare directory.
Make a file called gweb4.sh and, for reasons explained before,
put it
in the same directory as the Shakespeare files.
Now gweb4.sh could be a program to search all HTML files in and below the current directory, to any number of levels.
But to keep things simple, we will hard-code it to just search */*html
That is how it will be tested on Einstein.
Start with gweb4.sh looking like this:
Run it like:
./gweb4.sh france
This finds all lines in Shakespeare with the string "france" in any case.
It will output to the command line.
Make sure it works before continuing!
Output to web page
Change it to output to the exact offline output web page I asked for above.
exec > [output file]
grep -i "$1" */*html
When debugging, comment/uncomment that line
to switch between output to file and output to command line.
Make sure it works before continuing!
Fix line breaks
Now if you double click on that page, you will notice it is a mess.
Why? Because HTML does not respect line breaks.
To make HTML respect line breaks, we need to output
"<pre>" before the grep
and "</pre>" after it.
exec > [output file]
[output pre tag]
grep -i "$1" */*html
[output end pre tag]
Try it again:
./gweb4.sh france
Fix HTML tags
It still looks a mess when you view it in the browser.
Why? Because it has HTML tags in it, and these get interpreted.
We want to just output the HTML tags, not interpret them.
See
How to print HTML tags without interpreting them.
So we get:
exec > [output file]
[output pre tag]
grep -i "$1" */*html | [sed command]
[output end pre tag]
Try it again:
./gweb4.sh france
Make the hits clickable
The output is looking better,
but grep output is not clickable. It looks like this:
file.html: hit
To make it clickable,
change the main line of the script to:
grep -i "$1" */*html | [sed command] | clickable
And then add the following
Shell function at the top of the script:
clickable()
{
while read line
do
file=` echo "$line" | [cut before the colon] `
hit=` echo "$line" | [cut after the colon] `
echo "<a href='$file'>$file</a>: $hit <br>"
done
}
See How to use cut to parse grep output.
Respect the backquotes.
Try it again:
./gweb4.sh france
Final fix of the links
We are nearly there.
We run it and it builds a web page of hits.
But the links do not work. Why?
It is all to do with where the output file is and where the files we searched are.
Fix the links by adding an adjustment:
echo "<a href='[some prefix]$file'>$file</a>: $hit <br>"
Test it locally before upload.
Make sure it constructs clickable links that work.
You now have your search engine!
Upload to Einstein
It works locally, but we still have a problem when we upload it to Einstein for marks.
The problem is your Shakespeare directory does not exist on Einstein.
So when you upload it to Einstein, put this at the top:
cd /shared/humphrys/shakespeare
But then to make the links work, you need a different adjustment:
echo "<a href='[some other prefix]$file'>$file</a>: $hit <br>"
Now it should work on Einstein and you get the marks.
Using it outside lab exam
In normal life (outside lab exam)
you would use this script as follows:
- Change script to a simpler name, like "gweb".
- Put it in $HOME/bin and make sure PATH is set up.
- Inside the script, "cd" to the correct directory where the web pages are.
- Output can continue to go to a file in $HOME, or somewhere else. Just make sure the links work.
- Now you can be in any directory, and just:
gweb string
and you get an output web page of hits.