Dr. Mark Humphrys

School of Computing. Dublin City University.

Online coding site: Ancient Brain

coders   JavaScript worlds

Search:

Free AI exercises


XML and HTML

Machine readable v. human readable content.




XML



XML example

Example XML file: test.ajax.xml

Pure data file (not web page). Contains US states.

 
<?xml version="1.0"?>
<choices xml:lang="EN">
        <item><label>Alabama</label><value>AL</value></item>
...
        <item><label>Wyoming</label><value>WY</value></item>
</choices>





XML example - Microsoft Office XML


Part of a Word file in XML format (DOCX):


See explanation: Microsoft Office XML file formats.






XML example - Flickr XML feeds




Note invention of own tags:

...
<entry>
                <title>by Nathan Coley</title>
                <link rel="alternate" type="text/html" href="http://www.flickr.com/photos/jeremydp/3992143711/"/>
                <id>tag:flickr.com,2005:/photo/3992143711</id>
                <published>2009-10-08T12:00:32Z</published>
                <updated>2009-10-08T12:00:32Z</updated>
                <dc:date.Taken>2009-10-03T21:59:32-08:00</dc:date.Taken>
                <content type="html"> .... </content>
                <author>
                        <name>jeremyDP</name>
                        <uri>http://www.flickr.com/people/jeremydp/</uri>
                                        </author>
        <link rel="enclosure" type="image/jpeg" href="http://farm3.static.flickr.com/2479/3992143711_1353c6f932_m.jpg" />

                <category term="paris" scheme="http://www.flickr.com/photos/tags/" />
                <category term="2009" scheme="http://www.flickr.com/photos/tags/" />
                <category term="butteschaumont" scheme="http://www.flickr.com/photos/tags/" />
                <category term="nuitblanche" scheme="http://www.flickr.com/photos/tags/" />
                <category term="nathancoley" scheme="http://www.flickr.com/photos/tags/" />                        
</entry>
...
  


Machine readable web






RSS (XML web feeds)






Parsing XML / HTML

There is support in many programming languages for parsing XML / HTML.
The problem is they may fail on badly-formed XML / HTML (i.e. lots of HTML).


  1. Javascript


  2. jQuery


  3. Shell - Command-line tools that you can use in shell scripts
    • xpath query language (and here) - Parse well-formed XML
    • xpath command on Linux. I cannot find a man page!
    • Test with example XML file: test.ajax.xml
      # extract nodes delimited by <choices>
            cat test.ajax.xml | xpath //choices
      
      # extract nodes delimited by <item> within those
            cat test.ajax.xml | xpath //choices//item
      
      # get first node only
            cat test.ajax.xml | xpath "(//choices//item)[1]"
            cat test.ajax.xml | xpath "//item[1]"
      
      # get text inside tags
        cat test.ajax.xml | xpath "//item[1]" | xpath "//label[1]"		
        cat test.ajax.xml | xpath "//item[1]" | xpath "//label[1]/text()"   > outputfile
      
      


  4. Java - Parsing HTML in Swing


  5. List of HTML parsers
    • Many in this list are designed to be error-tolerant and able to parse the badly-formed HTML that is found "in the wild". See ones called names like "soup" and "tidy".
    • Some are program libraries (e.g. Java libraries). Some are stand-alone command-line tools.
    • The "tag soup" concept.
    • The TagSoup Java library by John Cowan

Strategy for parsing HTML:
  1. Use error-tolerant readers to convert badly-formatted HTML to well-formatted HTML.
  2. Can now parse the well-formatted HTML with other, more picky programs like xpath.



XHTML

As the web matured, the HTML standards people started to consider the problem of bad HTML.



Screenshot from XHTML ebook shows the utopian vision of XHTML.


  

Re-write the Web?

 


My first decent mobile Internet device: XDA Exec = HTC Universal (2005).
This had no problem rendering malformed HTML.




Strict, well-formed data is good (so long as not compulsory)

There is a good rule: "Be conservative in what you send, be liberal in what you accept". i.e. For a new project, why not output strict, well-formed data (XHTML, or validated HTML). It will make it easier for your team to re-purpose your content in the future.

I am only pointing out that this cannot be the entire world ("be liberal in what you accept"). One must also consider:

  1. Old pages.
  2. New pages written by people who do not conform to standards. (You might say "amateurs". Or you might say "people with other jobs".) There are millions of such pages and sites. There are new such pages and sites created every day. Consider even just the web pages of all computer lecturers at DCU. How many validate their HTML?
Unlikely the web will ever be well-formed. And maybe it doesn't matter.



2009 post tries to validate HTML on major websites, and suggests that the majority of the web is malformed. And it doesn't seem to matter.


  

HTML5 instead of XHTML



JSON instead of XML



Human-readable web and machine-readable web stay separate



ancientbrain.com      w2mind.org      humphrysfamilytree.com

On the Internet since 1987.      New 250 G VPS server.

Note: Links on this site to user-generated content like Wikipedia are highlighted in red as possibly unreliable. My view is that such links are highly useful but flawed.