Find overloaded pages
Find web pages (among my local files on disk, not remote pages)
that are overloaded with too many / too large
embedded
images.
(Slowly-loading pages.)
4 scripts
We are going to write 4 scripts:
When combined as follows,
they will add up the total size of all embedded images (JPEGs only) in this HTML file:
tot1 file.html | tot2 | tot3 | tot4
We will be testing the scripts independently.
Notes
Embedded images look like:
<img src="filename">
<img .... src="filename">
<img .... src="filename" .... >
To test it we will run it on pages in my test suite:
cd /users/tutors/mhumphrysdculab/share/testsuite
For 30% - tot1
- grep the file for lines with an embedded image.
- Put newlines before and after every HTML tag.
- grep again for embedded images.
- Use grep to get rid of lines with 'http'
- Use grep to search for JPEGs only.
- For simplicity, assume all JPEGs have extension: .JPG or .jpg
You should now just have a list of embedded local images, JPEGs only, like this:
$ cd /users/tutors/mhumphrysdculab/share/testsuite/Cashel
$ tot1 george.html
<img src="Bitmaps/ric.crop.2.jpg">
<img width="98%" src="../Kickham/08.Mullinahone/SA400010.small.JPG">
<img width="98%" src="../Kickham/08.Mullinahone/SA400028.small.JPG">
<img width="95%" src="07.Carlow.Stn/SA400069.lores.jpg">
<img border=1 width="95%" src="07.Carlow.Stn/SA400070.lores.adjust.jpg">
|
Note it finds JPEGs of any case.
It does not return non-JPEG images.
For 50% - tot2
You will pipe the above output into a second script
"tot2"
which will extract the image file names as follows.
- Use sed
to delete everything from start-of-line
to src="
- sed by default is case sensitive.
So also delete everything up to SRC="
in case some are uppercase.
- Use sed to delete everything from " to end-of-line.
You should now have a better list of local images, like this:
$ tot1 george.html | tot2
Bitmaps/ric.crop.2.jpg
../Kickham/08.Mullinahone/SA400010.small.JPG
../Kickham/08.Mullinahone/SA400028.small.JPG
07.Carlow.Stn/SA400069.lores.jpg
07.Carlow.Stn/SA400070.lores.adjust.jpg
|
For 70% - tot3
You will pipe the above into a further script
"tot3"
which will
add up the file sizes.
- One issue is that some of the testsuite pages have broken links.
For each file, we need to test if it exists
before going to find its file size.
- So we start off with tot3 looking like this:
while read file
do
if test -f $file
then
ls -l $file
fi
done
|
- Check this works before proceeding. Something like:
$ tot1 george.html | tot2 | tot3
-rwxr-xr-x 1 mhumphrysdculab tutors 39139 Sep 17 2015 Bitmaps/ric.crop.2.jpg
-rwxr-xr-x 1 mhumphrysdculab tutors 339817 Sep 17 2015 07.Carlow.Stn/SA400069.lores.jpg
-rwxr-xr-x 1 mhumphrysdculab tutors 190968 Sep 17 2015 07.Carlow.Stn/SA400070.lores.adjust.jpg
|
- (Note we have removed the files that do not exist.)
Now we will fix tot3:
- Comment out the ls
- To just print the file size, insert something like:
stat --printf="%s" $file
- (I left something out. You need new line after the file size. You figure out how to do that.)
- Check this works before proceeding. Something like:
$ tot1 george.html | tot2 | tot3
39139
339817
190968
|
For 100% - tot4
- Pipe the above to a further script "tot4"
which adds up the file sizes.
It looks like this:
TOTAL=0
while read size
do
(missing line)
done
echo "$TOTAL"
|
- The missing line uses
Arithmetic in Shell
to do:
TOTAL = TOTAL + size.
Test
Your finished script should work like this:
$ cd /users/tutors/mhumphrysdculab/share/testsuite/Cashel
$ tot1 george.html | tot2 | tot3 | tot4
569924
$ cd /users/tutors/mhumphrysdculab/share/testsuite/ORahilly
$ tot1 the.orahilly.note.html | tot2 | tot3 | tot4
2515442
|
Further work
- Not part of this test, but you could use
Shell functions
to combine the 4 scripts into one script.
- You could also change the script to
get stats for every file in a wildcard (like *html)
instead of just a single file.
-
Imagine using this script to search thousands of pages for the most overloaded pages.