Wikify a page
Can be done in 50 lines of shell or so.
- Usage: wikify file.html > file.wiki.html
- "Wikifies" file.html, output to stdout
- For all capitalised (i.e. might be
proper noun)
and un-linked words Word ...
- See Parsing XML / HTML
- Can find capitalised word with
grep '[A-Z][a-z]'
- Can extract all links with something like:
cat file.xhtml | xpath '//a[@href]'
- ... Link the word to
http://en.wikipedia.org/wiki/Word
- (We could check if that URL exists,
but I don't want this class practical
to cause trouble for Wikipedia's servers,
so we will not check here.)
- Only link the first occurrence of Word, not subsequent occurrences.
- Q. How do you avoid Wikifying words inside tags:
<title> Word Word Word </title>
<a href=url> Word Word Word </a>
- Test on a sample page from
the
corpus of the works of Shakespeare.
- If you pick the same page as another student, I may get suspicious
and compare your code.
- What to hand up
(Note show HTML source before and after).