by Jon Jermey. First published in Online Currents – Vol.16 Issue 9, November 2001
XML is a language related to HTML that allows material to be marked up semantically rather than for its desired appearance or role in a document. XML provides an open standard for compiling and accessing diverse data collections on a world-wide basis. Explicit specifications can be provided to control and validate XML data structures. XML also includes tools for converting between different sets of specifications.
Anyone working in computing knows that there is always a Next Big Thing – a development looming on the horizon that will change the world and lead us all into the Promised Land. Some NBTs fizzle out altogether, some end up doing a minor but respectable job in a niche somewhere, and a few actually make it. Extended Markup Language (XML) is currently being proposed as the Next Big Thing for database access on the Internet. What is it all about, and does it really have the potential claimed for it?
History of XML
XML, like HTML, is a spin-off from the large and complex Standard Generalised Markup Language (SGML), ratified by ISO 8879 in 1986. The XML Working Group removed some of the less-used features of SGML and worked towards making a relatively simple language that could be used by a wide variety of applications over the Internet. An important goal was making XML documents easy for human beings to create and read. Usability comes at a cost: XML is not terse or compact, and a large XML database will take up considerably more space than its equivalent in, say, Microsoft Access. The complete XML 1.0 Recommendation1 and an Annotated XML Recommendation2 can both be found on the Web.
How it Works
Like HTML, XML consists of tags that are placed in documents – typically pages on the Web. In its simplest form it can be used to identify sections (‘elements’) of the text semantically: for instance, a section of a book review might be marked up as follows:
|My Life in the Desert by Lady Sandy Dewnes OBE
reviewed by Jon
As shown above, tags can be nested; here the AUTHOR and REVIEWER elements each contain two tags, for FIRSTNAME and LASTNAME.
Tags can also contain attributes. In the example above, it’s reasonable to expect that all our authors and reviewers will have first names and last names. By making FIRSTNAME and LASTNAME into attributes of the AUTHOR and REVIEWER tags, it becomes possible to search for authors’ names and reviewers’ names independently:
|My Life in the Desert by Lady Sandy Dewnes OBE
Jon Jermey, AusSI Webmaster
This also allows for more flexibility in the text on the page while maintaining control over the terms that are available for searching; e.g. ‘Sandy’ can appear on the page while ‘Sandra’ is the specified search term.
Because Web browsers ignore tags that don’t form part of valid HTML, XML markup is invisible to the ordinary user. Thus a Web page containing the text above will appear as:
|My Life in the Desert by Lady Sandy Dewnes OBE reviewed by Jon Jermey, AusSI Webmaster|
Tags and attributes can be created at the whim of the author; there is no limitation in XML on which tags or attributes can be used. For this reason XML is not really alanguage but a grammar from which many different languages can be developed.
XML Database Pages
Most advocates of XML plan for more than this simple embedding into Web pages; they are thinking in terms of online databases assembled from XML elements. Because of XML’s nesting capabilities, an elaborate hierarchy of elements is possible. Here, for instance, is a structure that can run from state level down to individual citizens:
22 Greave St
With this structure established and consistently applied it becomes relatively easy to, say, search for all 2-car households with an annual income of less than $40,000 in cities with a population between 1 and 2 million.
XML Validation and Conversion
XML allows authors to create tags and attribute names as they go, but this can lead to serious inconsistencies when large data sets are being assembled over time. To control the XML used in any particular document, that document can be associated with a schema that can either be embedded in the document or stored as a separate file.
A schema that specifies the data permitted in an XML document is called a Document Type Definition, or DTD. Any XML document can be validated against a DTD to ensure it meets the conditions for that particular data set. DTDs determine such things as:
- Whether a particular element can or must contain text: e.g. John Smith
- Whether a particular element can or must contain other elements, and if so how many and which: e.g. must contain one and only one element, but may contain up to three elements
- When a particular attribute must come from a predefined list – e.g. possible values for GENDER may be limited to “Male” or “Female”
- ‘Variables’ on the page that are displayed as values defined in the DTD: these are called entities, and allow for quick global substitutions at the time the XML document is displayed in a browser or otherwise opened. Thus “John Howard” could be substituted for the entity primeminister when a document is accessed.
Several international bodies have adopted DTDs to control and validate information they put online; these include the Scalable Vector Graphics group3 at the World Wide Web Consortium (W3C), the Biztalk group maintained by Microsoft for exchanging business-related information4, and the Health Level 7 XML group5.
Both Microsoft Internet Explorer V5.5 and the Microsoft XML Notepad support validation of XML documents, although the methods they employ are not particularly useful or intuitive.
There are several commercial systems available for producing XML documents, but an easy way to get started is with the free XML Notepad program from Microsoft6. This displays raw XML for entry and editing with the data laid out in a tabular format, and produces an ASCII output file with the extension .XML. The XML Notepad validates files against their DTDs, if these exist, upon opening, but unfortunately not on saving, so that it is quite possible to save a file with errors and then be unable to re-open it for correction (this then has to be done manually with a text editor). Other free systems, including a beta release from IBM, proved difficult to install and unreliable in use.
Figure 1: An XML Document in Microsoft XML Notepad
The Microsoft Internet Explorer 5.5 contains an XML viewer and validator. Opening a valid XML document displays a nested list of tags and properties, as shown below. These are expandable: elements containing child elements can be clicked to ‘unroll’ or ‘roll up’ the child elements appearing below them, in a similar way to the Outline view in Word. Attempts to open an invalid XML document will result in an error message indicating the location of the invalid section.
Figure 2: An XML Document in Microsoft Internet Explorer
XML comes with its own XLS style sheets which can be used to control the appearance of the data. A dramatic example can be downloaded from the Microsoft site7, where the results of an auction are shown as an XML file that can be viewed in four different ways through applying different style sheets.
Searching Using XML
Demos of XML-based searching can be found on the Web, usually associated with commercial systems. Unfortunately, sites that offer XML searching don’t always make it clear what role XML plays in the database or the search system, and without access to the original documents it is difficult to evaluate the success of the search. Go XML8is a user-friendly example offering demonstration searches of Shakespeare’s plays and the Canadian airline flight schedule, where XML is used to define a category for searching. A list of XML search tools can be found at Searchtools.com9.
1XML 1.0 Recommendation: http://www.w3.org/TR/REC-xml
2 Annotated XML Recommendation – explanations in plain English as to why certain syntax was used and particular approaches were taken:http://www.xml.com/axml/testaxml.htm
3 The Scalable Vector Graphics group (http://www.w3.org/Graphics/SVG ) has developed a consistent way to transmit and represent vector-based graphics through a browser using XML. A free browser add-in to display these is available from Adobe (http://www.adobe.com )
4 This includes financial figures for economic comparisons – http://www.biztalk.org
5 For maintaining health and medical databases – http://www.hl7.org
6 Search Downloads at http://www.microsoft.com
9 http://www.searchtools.com . This site also provides links to other search tools for comparison.