XML And The Future Web

  

The extraordinary growth of the World Wide Web has been fueled by the ability it gives authors to easily and cheaply distribute electronic documents to an international audience. As Web documents have become larger and more complex, however, Web content providers have begun to experience the limitations of a medium that does not provide the extensibility, structure, and data checking needed for large-scale commercial publishing. The ability of Java applets to embed powerful data manipulation capabilities in Web clients makes even clearer the limitations of current methods for the transmittal of document data.

To address the requirements of commercial Web publishing and enable the further expansion of Web technology into new domains of distributed document processing, the World Wide Web Consortium has developed an Extensible Markup Language (XML) for applications that require functionality beyond the current Hypertext Markup Language (HTML).

 

Background: HTML and SGML

 

Most documents on the Web are stored and transmitted in HTML. HTML is a simple language well suited for hypertext, multimedia, and the display of small and reasonably simple documents. HTML is based on SGML (Standard Generalized Markup Language, ISO 8879), a standard system for defining and using document formats.

SGML allows documents to describe their own grammar -- that is, to specify the tag set used in the document and the structural relationships that those tags represent. HTML applications are applications that hardwire a small set of tags in conformance with a single SGML specification. Freezing a small set of tags allows users to leave the language specification out of the document and makes it much easier to build applications, but this ease comes at the cost of severely limiting HTML in several important respects, chief among which are extensibility, structure, and validation.

  • Extensibility. HTML does not allow users to specify their own tags or attributes in order to parameterise or otherwise semantically qualify their data.
  • Structure. HTML does not support the specification of deep structures needed to represent database schemas or object-oriented hierarchies.
  • Validation. HTML does not support the kind of language specification that allows consuming applications to check data for structural validity on importation.

In contrast to HTML stands generic SGML. A generic SGML application is one that supports SGML language specifications of arbitrary complexity and makes possible the qualities of extensibility, structure, and validation missing from HTML. SGML makes it possible to define your own formats for your own documents, to handle large and complex documents, and to manage large information repositories. However, full SGML contains many optional features that are not needed for Web applications and has proven to have a cost/benefit ratio unattractive to current vendors of Web browsers.

 

The XML effort

 

The World Wide Web Consortium (W3C) has created an SGML Working Group to build a set of specifications to make it easy and straightforward to use the beneficial features of SGML on the Web. The goal of the W3C SGML activity is to enable the delivery of self-describing data structures of arbitrary depth and complexity to applications that require such structures.

The first phase of this effort is the specification of a simplified subset of SGML specially designed for Web applications. This subset, called XML (Extensible Markup Language), retains the key SGML advantages of extensibility, structure, and validation in a language that is designed to be vastly easier to learn, use, and implement than full SGML.

XML differs from HTML in three major respects:

  1. Information providers can define new tag and attribute names at will.
  2. Document structures can be nested to any level of complexity.
  3. Any XML document can contain an optional description of its grammar for use by applications that need to perform structural validation.

XML has been designed for maximum expressive power, maximum teachability, and maximum ease of implementation. The language is not backward-compatible with existing HTML documents, but documents conforming to the W3C HTML 3.2 specification can easily be converted to XML, as can generic SGML documents and documents generated from databases.

 

Web applications of XML

 

The applications that will drive the acceptance of XML are those that cannot be accomplished within the limitations of HTML. These applications can be divided into four broad categories:

  1. Applications that require the Web client to mediate between two or more heterogeneous databases.
  2. Applications that attempt to distribute a significant proportion of the processing load from the Web server to the Web client.
  3. Applications that require the Web client to present different views of the same data to different users.
  4. Applications in which intelligent Web agents attempt to tailor information discovery to the needs of individual users.

The alternative to XML for these applications is proprietary code embedded as "script elements" in HTML documents and delivered in conjunction with proprietary browser plug-ins or Java applets. XML derives from a philosophy that data belongs to its creators and that content providers are best served by a data format that does not bind them to particular script languages, authoring tools, and delivery engines but provides a standardised, vendor-independent, level playing field upon which different authoring and delivery tools may freely compete.

Web agents: data that knows about me

 

A future domain for XML applications will arise when intelligent Web agents begin to make larger demands for structured data than can easily be conveyed by HTML. Perhaps the earliest applications in this category will be those in which user preferences must be represented in a standard way to mass media providers.

Consider a personalised TV guide for the fabled 500-channel cable TV system. A personalised TV guide that works across the entire spectrum of possible providers requires not only that the user's preferences and other characteristics (educational level, interest, profession, age, visual acuity) be specified in a standard, vendor-independent manner -- obviously a job for an industry-standard markup system -- but also that the programs themselves be described in a way that allows agents to intelligently select the ones most likely to be of interest to the user. This second requirement can be met only by a standardised system that uses many specialised tags to convey specific attributes of a particular program offering (subject category, audience category, leading actors, length, date made, critical rating, specialised content, language, etc.). Exactly the same requirements would apply to customised newspapers and many other applications in which information selection is tailored to the individual user.

While such applications still lie over the horizon, it is obvious that they will play an increasingly important role in our lives and that their implementation will require XML-like data in order to function interoperably and thereby allow intelligent Web agents to compete effectively in an open market.

 

Advanced linking and stylesheet mechanisms

 

Outside XML as such, but an integral part of the W3C SGML effort, are powerful linking and stylesheet mechanisms that go beyond current HTML-based methods just as XML goes beyond HTML.

Despite its name and all of the publicity that has surrounded HTML, this so-called "hypertext markup language" actually implements just a tiny amount of the functionality that has historically been associated with the concept of hypertext systems. Only the simplest form of linking is supported -- unidirectional links to hardcoded locations. This is a far cry from the systems that were built and proven during the 1970s and 1980s.

In a true hypertext system of the kind envisioned for the XML effort, there will be standardised syntax for all of the classic hypertext linking mechanisms:

  • Location-independent naming
  • Bidirectional links
  • Links that can be specified and managed outside of documents to which they apply
  • N-ary hyperlinks (e.g., rings, multiple windows)
  • Aggregate links (multiple sources)
  • Transclusion (the link target document appears to be part of the link source document)
  • Attributes on links (link types)

 

Stylesheets

 

The current CSS (cascading style sheets) effort provides a style mechanism well suited to the relatively low-level demands of HTML but incapable of supporting the greatly expanded range of rendering techniques made possible by extensible structured markup. The counterpart to XML is a stylesheet programming language that is:

  • Freely extensible so that stylesheet designers can define an unlimited number of treatments for an unlimited variety of tags.
  • Turing-complete so that stylesheet designers can arbitrarily extend the available procedures.
  • Based on a standard syntax to minimise the learning curve.
  • Able to address the entire tree structure of an XML document in structural terms, so that context relationships between elements in a document can be expressed to any level of complexity.
  • Completely internationalised so that left-to-right, right-to-left, and top-to-bottom scripts can all be dealt with, even if mixed in a single document.
  • Provided with a sophisticated rendering model that allows the specification of professional page layout features such as multiple column sets, rotated text areas, and float zones.
  • Defined in a way that allows partial rendering in order to enable efficient delivery of documents over the Web.

Such a language already exists in a new international standard called the Document Style Semantics and Specification Language (DSSSL, ISO/IEC 10179). Published in April, 1996, DSSSL is the stylesheet language of the future for XML documents.

HTML functions well as a markup for the publication of simple documents and as a transportation envelope for downloadable scripts. However, the need to support the much greater information requirements of standardised Java applications will necessitate the development of a standard, extensible, structured language and similarly expanded linking and stylesheet mechanisms. The W3C SGML effort is actively developing a set of specifications that will allow these objectives to be met within an open standards environment.

Hardware | Software | The Internet