Find overloaded pages
Find web pages (among local files on disk, not remote web pages)
that are overloaded with too many / too large
embedded
images.
(Slowly-loading pages.)
4 scripts
We are going to write 4 scripts:
When combined as follows,
they will add up the total size of all embedded images (JPEGs only) in this HTML file:
tot1 file.html | tot2 | tot3 | tot4
We will be testing the scripts independently.
Notes
Embedded images look something like:
<img src="filename">
<img .... src="filename">
<img .... src="filename" .... >
In fact, we may have to consider
single quotes, or
no quotes, in real life.
But in the pages we will test this on, we can assume there are double quotes.
The pages we are testing this on are in:
/home/exam/testsuite
Setup
To setup, we need to fix the PATH on the lab exam login.
This is so we can call programs by just typing program name.
I can help you with this bit.
Type this:
mkdir /home/exam/bin
PATH=$PATH:/home/exam/bin
For 30% - tot1
- grep the file for lines with an embedded image.
- Put newlines before and after every HTML tag.
- grep again for embedded images.
- Use grep to get rid of lines with 'http'
- Use grep to search for JPEGs only.
- For simplicity, assume all JPEGs have extension: .JPG or .jpg
You should now just have a list of embedded local images, JPEGs only, like this:
(Go into testsuite.)
$ cd Cashel
$ tot1 george.html
<img src="Bitmaps/ric.crop.2.jpg">
<img width="98%" src="../Kickham/08.Mullinahone/SA400010.small.JPG">
<img width="98%" src="../Kickham/08.Mullinahone/SA400028.small.JPG">
<img width="95%" src="07.Carlow.Stn/SA400069.lores.jpg">
<img border=1 width="95%" src="07.Carlow.Stn/SA400070.lores.adjust.jpg">
|
Note it finds JPEGs of any case.
It does not return non-JPEG images.
For 50% - tot2
You will pipe the above output into a second script
"tot2"
which will extract the image file names as follows.
- Use sed
to delete everything from start-of-line
to src="
- sed by default is case sensitive.
So also delete everything up to SRC="
in case some are uppercase.
- Use sed to delete everything from " to end-of-line.
You should now have a better list of local images, like this:
$ tot1 george.html | tot2
Bitmaps/ric.crop.2.jpg
../Kickham/08.Mullinahone/SA400010.small.JPG
../Kickham/08.Mullinahone/SA400028.small.JPG
07.Carlow.Stn/SA400069.lores.jpg
07.Carlow.Stn/SA400070.lores.adjust.jpg
|
For 70% - tot3
You will pipe the above into a further script
"tot3"
which will
add up the file sizes.
- One issue is that some of the testsuite pages have broken links.
For each file, we need to test if it exists
before going to find its file size.
- So we start off with tot3 looking like this:
while read file
do
if test -f $file
then
ls -l $file
fi
done
|
- Check this works before proceeding. Something like:
$ tot1 george.html | tot2 | tot3
-rwxr-xr-x 1 mhumphrysdculab tutors 39139 Sep 17 2015 Bitmaps/ric.crop.2.jpg
-rwxr-xr-x 1 mhumphrysdculab tutors 339817 Sep 17 2015 07.Carlow.Stn/SA400069.lores.jpg
-rwxr-xr-x 1 mhumphrysdculab tutors 190968 Sep 17 2015 07.Carlow.Stn/SA400070.lores.adjust.jpg
|
- (Note we have removed the files that do not exist.)
Now we will fix tot3:
- Comment out the ls
- To just print the file size, insert:
stat --printf="%s\n" $file
- Check this works before proceeding. Something like:
$ tot1 george.html | tot2 | tot3
39139
339817
190968
|
For 100% - tot4
- Pipe the above to a further script "tot4"
which adds up the file sizes.
It looks like this:
TOTAL=0
while read size
do
(missing line)
done
echo "$TOTAL"
|
- The missing line uses
Arithmetic in Shell
to do:
TOTAL = TOTAL + size.
Test
Your finished script should work like this:
(Go into testsuite.)
$ cd Cashel
$ tot1 george.html | tot2 | tot3 | tot4
569924
(Go into testsuite.)
$ cd ORahilly
$ tot1 the.orahilly.note.html | tot2 | tot3 | tot4
2515442
|
Further work
- Not part of this test, but you could use
Shell functions
to combine the 4 scripts into one script.
- You could also change the script to
get stats for every file in a wildcard (like *html)
instead of just a single file.
-
Imagine using this script to search thousands of pages for the most overloaded pages.