Find overloaded pages

Find web pages (among my local files on disk, not remote pages) that are overloaded with too many / too large embedded images. (Slowly-loading pages.)

totalimg

Write this script:

totalimg file.html

Add up total size of all embedded images in this HTML file.

Notes

Embedded images look like:

  <img src="filename">
  <img ....  src="filename">
  <img ....  src="filename" .... >

To test it we will run it on pages in my test suite:

cd /users/tutors/mhumphrysdculab/share/testsuite

Recipe

For 40%

grep the file for lines with an embedded image.
Put newlines before and after every HTML tag.
grep again for embedded images.
Use grep to get rid of lines with 'http'

You should now just have a list of embedded local images, like this:

$ cd /users/tutors/mhumphrysdculab/share/testsuite/Cashel $ totalimg george.html <img border=0 src="../Icons/pdf.gif"> <img border=0 src="../Icons/pdf.gif"> <img src="Bitmaps/ric.crop.2.jpg"> <img src="../Icons/me.gif"> <img width="98%" src="../Kickham/08.Mullinahone/SA400010.small.JPG"> <img width="98%" src="../Kickham/08.Mullinahone/SA400028.small.JPG"> <img width="95%" src="07.Carlow.Stn/SA400069.lores.jpg"> <img border=1 width="95%" src="07.Carlow.Stn/SA400070.lores.adjust.jpg">

For 60%

Extract the image file names as follows.

Use sed to delete everything from start-of-line to src="
sed by default is case sensitive. So also delete everything up to SRC=" in case some are uppercase.
Use sed to delete everything from " to end-of-line.

You should now have a better list of local images, like this:

$ totalimg george.html ../Icons/pdf.gif ../Icons/pdf.gif Bitmaps/ric.crop.2.jpg ../Icons/me.gif ../Kickham/08.Mullinahone/SA400010.small.JPG ../Kickham/08.Mullinahone/SA400028.small.JPG 07.Carlow.Stn/SA400069.lores.jpg 07.Carlow.Stn/SA400070.lores.adjust.jpg

For 80%

Pipe the previous into a second script total2 which will add up the file sizes.
One issue is that some of the testsuite pages have broken links. For each file, we need to test if it exists before going to find its file size.

So we start off with total2 looking like this:

while read file do if test -f $file then ls -l $file fi done

Check this works before proceeding. Something like:

$ totalimg george.html -rwxr-xr-x 1 mhumphrysdculab tutors 426 Sep 17 2015 ../Icons/pdf.gif -rwxr-xr-x 1 mhumphrysdculab tutors 426 Sep 17 2015 ../Icons/pdf.gif -rwxr-xr-x 1 mhumphrysdculab tutors 39139 Sep 17 2015 Bitmaps/ric.crop.2.jpg -rwxr-xr-x 1 mhumphrysdculab tutors 1005 Sep 17 2015 ../Icons/me.gif -rwxr-xr-x 1 mhumphrysdculab tutors 339817 Sep 17 2015 07.Carlow.Stn/SA400069.lores.jpg -rwxr-xr-x 1 mhumphrysdculab tutors 190968 Sep 17 2015 07.Carlow.Stn/SA400070.lores.adjust.jpg

(Note we have removed the files that do not exist.)
Comment out the ls
To just print the file size, insert:
stat --printf="%s" $file

Check this works before proceeding. Something like:

$ totalimg george.html 426 426 39139 1005 339817 190968

For 100%

Pipe the above to a 3rd script which looks like this:

TOTAL=0 while read size do (missing line) done echo "$TOTAL"

The missing line uses Arithmetic in Shell to do:
TOTAL = TOTAL + size.

Test

Your finished script should work like this:

$ cd /users/tutors/mhumphrysdculab/share/testsuite/Cashel $ totalimg george.html 571781 $ totalimg bushfield.html 3274461 $ cd /users/tutors/mhumphrysdculab/share/testsuite/ORahilly $ totalimg the.orahilly.note.html 2515730 $ totalimg ballylongford.html 1654649

Imagine using this script to search thousands of pages for the most overloaded pages.