Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


Wikify a page

Can be done in 50 lines of shell or so.

  1. Usage: wikify file.html > file.wiki.html
  2. "Wikifies" file.html, output to stdout

  3. For all capitalised (i.e. might be proper noun) and un-linked words Word ...
    • See Parsing XML / HTML
    • Can find capitalised word with grep '[A-Z][a-z]'
    • Can extract all links with something like:
      cat file.xhtml | xpath '//a[@href]'

  4. ... Link the word to http://en.wikipedia.org/wiki/Word
  5. (We could check if that URL exists, but I don't want this class practical to cause trouble for Wikipedia's servers, so we will not check here.)
  6. Only link the first occurrence of Word, not subsequent occurrences.

  7. Q. How do you avoid Wikifying words inside tags:
    <title> Word Word Word </title>
    <a href=url> Word Word Word </a>

  8. Test on a sample page from the corpus of the works of Shakespeare.
  9. If you pick the same page as another student, I may get suspicious and compare your code.

  10. What to hand up (Note show HTML source before and after).


ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.