|
The Extensible Markup Language (XML)
is a general-purpose markup language.It is classified
as an extensible language because it allows its
users to define their own tags. Its primary purpose
is to facilitate the sharing of structured data
across different information systems, particularly
via the Internet. It is used both to encode documents
and serialize data. In the latter context, it
is comparable with other text-based serialization
languages such as JSON and YAML.
It started as a simplified subset
of the Standard Generalized Markup Language (SGML),
and is designed to be relatively human-legible.
By adding semantic constraints, application languages
can be implemented in XML. These include XHTML,
RSS, MathML, GraphML, Scalable Vector Graphics,
MusicXML, and thousands of others. Moreover, XML
is sometimes used as the specification language
for such application languages.
XML is recommended by the World Wide
Web Consortium. It is a fee-free open standard.
The W3C recommendation specifies both the lexical
grammar, and the requirements for parsing.
Introduction
to XML
All about XML
XML Tutorial
Advantages of XML
It is text-based.
It supports Unicode, allowing almost any information
in any written human language to be communicated.
It can represent the most general computer science
data structures: records, lists and trees.
Its self-documenting format describes structure
and field names as well as specific values.
The strict syntax and parsing requirements make
the necessary parsing algorithms extremely simple,
efficient, and consistent.
XML is heavily used as a format for document storage
and processing, both online and offline.
It is based on international standards.
It can be updated incrementally.
It allows validation using schema languages such
as XSD and Schematron, which makes effective unit-testing,
firewalls, acceptance testing, contractual specification
and software construction easier.
The hierarchical structure is suitable for most
(but not all) types of documents.
It manifests as plain text files, which are less
restrictive than other proprietary document formats.
It is platform-independent, thus relatively immune
to changes in technology.
Forward and backward compatibility are relatively
easy to maintain despite changes in DTD or Schema.
Its predecessor, SGML, has been in use since 1986,
so there is extensive experience and software
available.
An element fragment of a well-formed XML document
is also a well-formed XML document.
Disadvantages of XML
XML syntax is redundant or large relative to binary
representations of similar data.
The redundancy may affect application efficiency
through higher storage, transmission and processing
costs.
XML syntax is verbose relative to other alternative
'text-based' data transmission formats.
No intrinsic data type support: XML provides no
specific notion of "integer", "string",
"boolean", "date", and so
on.
The hierarchical model for representation is limited
in comparison to the relational model or an object
oriented graph.
Expressing overlapping (non-hierarchical) node
relationships requires extra effort.
XML namespaces are problematic to use and namespace
support can be difficult to correctly implement
in an XML parser.
XML is commonly depicted as "self-documenting"
but this depiction ignores critical ambiguities.]
Processing XML files
Three traditional techniques for processing XML
files are:
Using a programming language and the
SAX API.
Using a programming language and the DOM API.
Using a transformation engine and a filter
More recent and emerging techniques for processing
XML files are:
Push Parsing
Data binding
Non-extractive XML Processing API
Simple API for XML
(SAX)
SAX is a lexical, event-driven interface in which
a document is read serially and its contents are
reported as "callbacks" to various methods
on a handler object of the user's design. SAX
is fast and efficient to implement, but difficult
to use for extracting information at random from
the XML, since it tends to burden the application
author with keeping track of what part of the
document is being processed. It is better suited
to situations in which certain types of information
are always handled the same way, no matter where
they occur in the document.
DOM
DOM is an interface-oriented Application Programming
Interface that allows for navigation of the entire
document as if it were a tree of "Node"
objects representing the document's contents.
A DOM document can be created by a parser, or
can be generated manually by users (with limitations).
Data types in DOM Nodes are abstract; implementations
provide their own programming language-specific
bindings. DOM implementations tend to be memory
intensive, as they generally require the entire
document to be loaded into memory and constructed
as a tree of objects before access is allowed.
Transformation engines and filters
A filter in the Extensible Stylesheet Language
(XSL) family can transform an XML file for displaying
or printing.
XSL-FO is a declarative, XML-based page
layout language. An XSL-FO processor can be used
to convert an XSL-FO document into another non-XML
format, such as PDF.
XSLT is a declarative, XML-based document
transformation language. An XSLT processor can
use an XSLT stylesheet as a guide for the conversion
of the data tree represented by one XML document
into another tree that can then be serialized
as XML, HTML, plain text, or any other format
supported by the processor.
XQuery is a W3C language for querying,
constructing and transforming XML data.
XPath is a DOM-like node tree data model
and path expression language for selecting data
within XML documents. XSL-FO, XSLT and XQuery
all make use of XPath. XPath also includes a useful
function library.
Push Parsing
A form of XML access that has become increasingly
popular in recent years is push parsing, which
treats the document as if it were a series of
items which are being read in sequence. This allows
for writing of recursive-descent parsers in which
the structure of the code performing the parsing
mirrors the structure of the XML being parsed,
and intermediate parsed results can be used and
accessed as local variables within the methods
performing the parsing, or passed down (as method
parameters) into lower-level methods, or returned
(as method return values) to higher-level methods.
For instance, in the Java programming language,
the StAX framework can be used to create what
is essentially an 'iterator' which sequentially
visits the various elements, attributes, and data
in an XML document. Code which uses this 'iterator'
can test the current item (to tell, for example,
whether it is a start or end element, or text),
and inspect its attributes (local name, namespace,
values of XML attributes, value of text, etc.),
and can also request that the iterator be moved
to the 'next' item. The code can thus extract
information from the document as it traverses
it. One significant advantage of push-parsing
methods is that they typically are much more speed-
and memory-efficient than SAX and DOM styles of
parsing XML. Another advantage is that the recursive-descent
approach tends to lend itself easily to keeping
data as typed local variables in the code doing
the parsing, while SAX, for instance, typically
requires a parser to manually maintain intermediate
data within a stack of elements which are parent
elements of the element being parsed. This tends
to mean that push-parsing code is often much more
straightforward to understand and maintain than
SAX parsing code. Some potential disadvantages
of push parsing are that it is a newer approach
which is not as well known among XML programmers
(although it is by far the most common method
used for writing compilers and interpreters for
languages other than XML), and that most existing
push parsers cannot yet perform advanced processing
such as XML schema validation as they parse a
document.
Data binding
Another form of XML Processing API is data binding,
where XML data is made available as a custom,
strongly typed programming language data structure,
in contrast to the interface-oriented DOM. Example
data binding systems are the Java Architecture
for XML Binding (JAXB) and the Strathclyde Novel
Architecture for Querying XML (SNAQue).
Non-extractive XML Processing API
Non-extractive XML Processing API is a new and
emerging category of parsers. The most representative
is VTD-XML, which abolishes the object-oriented
modeling of XML hierarchy and instead uses 64-bit
Virtual Token Descriptors (encoding offsets, lengths,
depths, and types) of XML tokens. VTD-XML's approach
enables a number of interesting features/enhancements,
such as high performance, low memory usage , ASIC
implementation , incremental update , and native
XML indexing.
|