Find overloaded pages

Find web pages (among my local files on disk, not remote pages) that are overloaded with too many / too large embedded images. (Slowly-loading pages.)

4 scripts

We are going to write 4 scripts:

tot1
tot2
tot3
tot4

When combined as follows, they will add up the total size of all embedded images (JPEGs only) in this HTML file:

tot1 file.html | tot2 | tot3 | tot4

We will be testing the scripts independently.

Notes

Embedded images look like:

  <img src="filename">
  <img ....  src="filename">
  <img ....  src="filename" .... >

To test it we will run it on pages in my test suite:

cd /users/tutors/mhumphrysdculab/share/testsuite

For 30% - tot1

grep the file for lines with an embedded image.
Put newlines before and after every HTML tag.
grep again for embedded images.
Use grep to get rid of lines with 'http'
Use grep to search for JPEGs only.
For simplicity, assume all JPEGs have extension: .JPG or .jpg

You should now just have a list of embedded local images, JPEGs only, like this:

$ cd /users/tutors/mhumphrysdculab/share/testsuite/Cashel $ tot1 george.html <img src="Bitmaps/ric.crop.2.jpg"> <img width="98%" src="../Kickham/08.Mullinahone/SA400010.small.JPG"> <img width="98%" src="../Kickham/08.Mullinahone/SA400028.small.JPG"> <img width="95%" src="07.Carlow.Stn/SA400069.lores.jpg"> <img border=1 width="95%" src="07.Carlow.Stn/SA400070.lores.adjust.jpg">

Note it finds JPEGs of any case.
It does not return non-JPEG images.

For 50% - tot2

You will pipe the above output into a second script "tot2" which will extract the image file names as follows.

Use sed to delete everything from start-of-line to src="
sed by default is case sensitive. So also delete everything up to SRC=" in case some are uppercase.
Use sed to delete everything from " to end-of-line.

You should now have a better list of local images, like this:

$ tot1 george.html | tot2 Bitmaps/ric.crop.2.jpg ../Kickham/08.Mullinahone/SA400010.small.JPG ../Kickham/08.Mullinahone/SA400028.small.JPG 07.Carlow.Stn/SA400069.lores.jpg 07.Carlow.Stn/SA400070.lores.adjust.jpg

For 70% - tot3

You will pipe the above into a further script "tot3" which will add up the file sizes.

One issue is that some of the testsuite pages have broken links. For each file, we need to test if it exists before going to find its file size.

So we start off with tot3 looking like this:

while read file do if test -f $file then ls -l $file fi done

Check this works before proceeding. Something like:

$ tot1 george.html | tot2 | tot3 -rwxr-xr-x 1 mhumphrysdculab tutors 39139 Sep 17 2015 Bitmaps/ric.crop.2.jpg -rwxr-xr-x 1 mhumphrysdculab tutors 339817 Sep 17 2015 07.Carlow.Stn/SA400069.lores.jpg -rwxr-xr-x 1 mhumphrysdculab tutors 190968 Sep 17 2015 07.Carlow.Stn/SA400070.lores.adjust.jpg

(Note we have removed the files that do not exist.)

Now we will fix tot3:

Comment out the ls
To just print the file size, insert something like:
stat --printf="%s" $file
(I left something out. You need new line after the file size. You figure out how to do that.)

Check this works before proceeding. Something like:

$ tot1 george.html | tot2 | tot3 39139 339817 190968

For 100% - tot4

Pipe the above to a further script "tot4" which adds up the file sizes. It looks like this:
TOTAL=0 while read size do (missing line) done echo "$TOTAL"
The missing line uses Arithmetic in Shell to do:
TOTAL = TOTAL + size.

Test

Your finished script should work like this:

$ cd /users/tutors/mhumphrysdculab/share/testsuite/Cashel $ tot1 george.html | tot2 | tot3 | tot4 569924 $ cd /users/tutors/mhumphrysdculab/share/testsuite/ORahilly $ tot1 the.orahilly.note.html | tot2 | tot3 | tot4 2515442

Further work

Not part of this test, but you could use Shell functions to combine the 4 scripts into one script.
You could also change the script to get stats for every file in a wildcard (like *html) instead of just a single file.
Imagine using this script to search thousands of pages for the most overloaded pages.