Einstein - Search files
We are going to write an
offline
search engine to search
offline web pages in your file system (not on the Web).
The output of the program is an
offline
output web page where you can click on links
and view them
offline.
- Usage would be like:
gweb string
- It knows where your web pages are. It searches them for the string, ignoring case.
- It outputs to an offline web page.
- Double-click on that file to view in web browser.
When you view that page, you can click on links to see the files.
Setup
Get files to search
- To run this locally, you need some files to search.
-
Download your own copy of the Shakespeare files.
- Then we can practice running the script on them.
Usage in lab exam
In the special environment of
the lab exam, we have some issues:
- For reasons already explained,
we need extension .sh
- The program for this lab must be called gweb4.sh
- For reasons already explained,
we probably will not change the PATH,
but rather call the program by
putting "./" in front.
- So in the lab exam we run it like:
./gweb4.sh string
- That assumes we are in the same dir as the program.
So we make sure that is true for the lab exam.
Output file
- Output must be sent to
a file called gweb.html in your $HOME directory.
(Use "$HOME" as the directory name.)
Getting started
Go into your Shakespeare directory.
Make a file called gweb4.sh and, for reasons explained before,
put it
in the same directory as the Shakespeare files.
Now gweb4.sh could be a program to search all HTML files in and below the current directory, to any number of levels.
But to keep things simple, we will hard-code it to just search */*html
That is how it will be tested on Einstein.
Start with gweb4.sh looking like this:
Run it like:
./gweb4.sh france
This finds all lines in Shakespeare with the string "france" in any case.
It will output to the command line.
Make sure it works before continuing!
Output to web page
Change it to output to the exact offline output web page I asked for above.
exec > [output file]
grep -i "$1" */*html
When debugging, comment/uncomment that line
to switch between output to file and output to command line.
Make sure it works before continuing!
Fix line breaks
Now if you double click on that page, you will notice it is a mess.
Why? Because HTML does not respect line breaks.
To make HTML respect line breaks, I suggest the following quick fix.
Output
"<pre>" before the grep
and "</pre>" after it.
exec > [output file]
[output pre tag]
grep -i "$1" */*html
[output end pre tag]
Try it again:
./gweb4.sh france
Fix HTML tags
It still looks a mess when you view it in the browser.
Why? Because it has HTML tags in it, and these get interpreted.
We want to just output the HTML tags, not interpret them.
See
How to print HTML tags without interpreting them.
So we get:
exec > [output file]
[output pre tag]
grep -i "$1" */*html | [sed command]
[output end pre tag]
Try it again:
./gweb4.sh france
Make the hits clickable
The output is looking better,
but grep output is not clickable. It looks like this:
file.html: hit
To make it clickable,
change the main line of the script to:
grep -i "$1" */*html | [sed command] | clickable
And then add the following
Shell function at the top of the script:
clickable()
{
while read line
do
file=` echo "$line" | [cut before the colon] `
hit=` echo "$line" | [cut after the colon] `
echo "<a href='$file'>$file</a>: $hit <br>"
done
}
See How to use cut to parse grep output.
Respect the backquotes.
Try it again:
./gweb4.sh france
Final fix of the links
We are nearly there.
We run it and it builds a web page of hits.
But the links do not work. Why?
It is all to do with where the output file is and where the files we searched are.
Fix the links by adding an adjustment:
echo "<a href='[some prefix]$file'>$file</a>: $hit <br>"
Test it locally before upload.
Make sure it constructs clickable links that work.
You now have your search engine!
Upload to Einstein
It works locally, but we still have a problem when we upload it to Einstein for marks.
The problem is your Shakespeare directory does not exist on Einstein.
So when you upload it to Einstein, put this at the top:
cd /shared/humphrys/shakespeare
But then to make the links work, you need a different adjustment:
echo "<a href='[some other prefix]$file'>$file</a>: $hit <br>"
Now it should work on Einstein and you get the marks.
Using it outside lab exam
In normal life (outside lab exam)
you would use this script as follows:
- Change script to a simpler name, like "gweb".
- Put it in $HOME/bin and make sure PATH is set up.
- Inside the script, "cd" to the correct directory where the web pages are.
- Output can continue to go to a file in $HOME, or somewhere else. Just make sure the links work.
- Now you can be in any directory, and just:
gweb string
and you get an output web page of hits.