4Suite Core: Open-source Library for XML Processing
Users' Manual

This version:
Revision 0.8 (2006-08-15)
Contributors:
Mike J. Brown
John L. Clark
Uche G. Ogbuji
Luis Miguel Morillas
Dave Pawson
Uche Ogbuji

Abstract

This document describes how to perform a set of XML manipulation tasks with the 4Suite XML processing library. These tasks include parsing XML using either DOM-like or SAX-like models, querying XML or XML models using XPath, using XSLT, using XUpdate, and validating documents with RELAX NG.


Table Of Contents

1 Introduction

2 Installation

3 DOM-like XML processing

3.1 Parsing XML documents

3.1.1 Quick access to the Domlette reader API

3.1.2 The full Domlette reader API

3.1.3 The importance of base URIs

3.1.4 Parsing XML that's already a Unicode string

3.1.5 NonvalidatingReader

3.1.6 EntityReader Examples

3.1.7 ValidatingReader

3.1.8 NoExtDtdReader

3.1.9 Creating your own reader instance

3.1.10 InputSource objects

3.1.11 Converting from other DOM libraries

3.2 Domlette API summary

3.2.1 What about getElementsByTagName()?

3.3 Serializing Domlette nodes

3.4 Building a DOM from scratch

3.5 XPath query

3.6 More on base URIs

3.7 Why does Domlette diverge from the DOM specification?

4 SAX

4.1 Validating a document while parsing it using SAX

4.2 Walking a DOM to fire SAX events

4.3 Building a Domlette from SAX events

4.4 Feeding a generator from SAX events

4.5 SAX filters

4.6 Streaming canonicalization

5 XPath queries

5.1 The quickest option

5.2 Type mappings

5.3 Advanced use

5.4 Reusing parsed XPath queries

5.5 Migration from PyXML's XPath

6 XSLT processing

6.1 The super-simple XSLT API

6.2 Full XSLT processing API

6.3 Example

6.4 Using Domlette objects instead of InputSources

6.5 Top-level parameters

6.6 Using xml-stylesheet processing instructions

6.7 Alternative output destinations

6.8 Transform chaining

6.9 XSLT patterns

7 XPath and XSLT extensions

7.1 Extension functions (XPath and XSLT)

7.2 Extension elements (XSLT)

7.3 Extension element API

7.3.1 Controlling output from XSLT extensions

7.3.2 Creating result tree fragments

7.3.3 Comunicating with the external code that invokes XSLT

8 Streaming XML output

8.1 Starting with MarkupWriter

8.2 How to insert elements

8.3 How to insert attributes

8.4 How to insert text nodes

8.5 How to insert a complete chunk

8.6 How to insert processing instructions and comments

8.7 Using namespaces

8.8 Setting up the output

8.9 More examples

8.9.1 Writing XHTML with MarkupWriter

8.9.2 Writing information of directory listing as a XML document

8.9.3 Building a bot

9 Validation using RELAX NG

10 XUpdate processing

10.1 XUpdate and namespaces

11 XInclude processing

11.1 About XInclude

11.2 XInclude support in 4Suite

11.3 Examples

12 XPointer processing

12.1 About XPointer

12.2 XPointer support in 4Suite

12.3 Examples

13 Comprehensive examples

13.1 Transforming DocBook using the DocBook XSL stylesheets

14 Resources


1 Introduction

4Suite allows users to take advantage of standard XML technologies rapidly and to develop and integrate Web-based applications. It also puts practical technologies for knowledge management projects in the hands of developers. It is implemented in Python with C extensions.

At the core of 4Suite is a library of integrated tools (including convenient command-line tools) for XML processing, implementing open technologies such as DOM, SAX, XSLT, XInclude, XPointer, XLink, XPath, XUpdate, RELAX NG, and XML/SGML Catalogs.

With 4Suite, you can:

And much more. These tasks are covered in this manual.

2 Installation

Please see the UNIX or Windows install documents. Remember that if you are using Cygwin on Windows, you should follow the UNIX instructions.

3 DOM-like XML processing

Domlette is 4Suite's lightweight DOM implementation. It is optimized for XPath operations, speed, and relatively low memory overhead. The Domlette API is accessible through Ft.Xml.Domlette. This section describes how to parse, manipulate, and then serialize XML documents using this API.

Below, we briefly summarize the various elements of the API that form the basic life span of Domlette objects.

Parsing XML documents

The Ft.Xml module contains the function Parse that gets the job done quickly. See “Quick access to the Domlette reader API” for details. For a bit more more advanced parsing, you will need a combination of the reader instances in the Ft.Xml.Domlette module and Ft.Xml.CreateInputSource for constructing InputSource instances. In rare cases you might need lower-level APIs in in the Ft.Xml.InputSource module. Read “The full Domlette reader API” if Ft.Xml.Parse isn't enough.

Modifying and interacting with XML documents

The Domlette API for interacting with XML documents—accessible as methods of the various Domlette objects—is similar to the DOM Level 2 specification. See “Domlette API summary” for more information.

Serializing XML documents

The Ft.Xml.Domlette module provides two functions, Print and PrettyPrint, for writing your XML documents. The Print function writes the XML document precisely as given in the model. On the other hand, the PrettyPrint function adds whitespace nodes to your document to try to indent the resulting output nicely. See “Serializing Domlette nodes” for details.

3.1 Parsing XML documents

We begin our discussion of the Domlette API by describing how to obtain a model of your XML documents to manipulate further. Because XML documents offer such rich functionality and exist in such varied environments, there can be a surprising amount of work that you must do to simply load your XML documents. We begin by providing a short-cut for easy access. We will then dive into the full suite of document loading utilities.

3.1.1 Quick access to the Domlette reader API

For basic document manipulations or to get started quickly, the Ft.Xml module offers a quick way to parse XML documents and directly obtain access to the Domlette interface to those documents. Within this module the function of interest is Parse.

Warning

This function will get you started quickly because it specifically chooses some default values for some of the more advanced parsing features. If you are passing in a string or stream, and the material in “The importance of base URIs” applies to your parsing situation, then you will want to use the full-featured API. In brief, if your XML document references external resources, you should not use this convenience function. See “The full Domlette reader API” instead.

This function returns a Domlette Document representing the root of the document from the argument.

Parse(source)

The Parse function takes a single argument, which is a byte string (not unicode object), file-like object (stream), file path or URI.

XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""

from Ft.Xml import Parse

doc = Parse(XML)
# If the above XML document were located in the file
# "target.xml", we could have used `Parse("target.xml")`.
print doc.xpath('string(ham//em[1])')

3.1.2 The full Domlette reader API

You create Domlette instances by parsing XML documents with the reader system. For general use, the Ft.Xml.Domlette package contains instances of the different reader classes that can be used directly after you import them. These instances include NonvalidatingReader and ValidatingReader, which provide non-validating parsing and validating parsing services, respectively. The validation in this case refers to DTD validation. For RELAX NG validation, see “Validation using RELAX NG”. All the reader classes (and, hence, their bundled instances) are described in later sections. After you have obtained one of these reader instances, you feed your XML document entity's byte stream to the reader. We summarize the available reader methods below.

parseUri(uri)

The parseUri method takes a single argument; this uri argument is the absolute URI of the document entity to parse. The URI will be dereferenced by the default resolver.

parseString(st, uri)

The parseString method takes two arguments; st is the XML document entity in the form of an encoded Python string (not a Unicode string). See the next section for details on the uri argument.

parseStream(stream, uri)

The parseStream method takes two arguments; stream is a Python file-like object that can supply the document entity's bytes via read() calls. See the next section for details on the uri argument.

parse(inputSource)

The parse method takes a single argument; inputSource is an Ft.Xml.InputSource.InputSource object, described in “InputSource objects”.

The next two sections cover some of the issues that you should understand before using these functions. Then we start seeing some examples in NonvalidatingReader.

3.1.3 The importance of base URIs

In the first 3 methods listed in the previous section, the uri argument is the URI of the document entity that you are feeding to the parser. It is a very important—but often overlooked—concept in document processing.

The URI gives the document entity a unique identifier that can used to refer to the document as a whole. Also, each Domlette node derived from a particular entity inherits that entity's URI as the node's baseURI property, unless an alternative base URI was indicated, such as with xml:base, or if part of the document was loaded as an external entity or XInclude.

The document's URI is also used as the "base URI" for resolving any relative URI references that may appear within the document itself. Relative URI references may occur in a document in places like:

  • <!DOCTYPE> or <!ENTITY>, immediately following the keyword SYSTEM

  • <xsl:import> and <xsl:include>, in the value of the href attribute

  • <xi:include>, in the value of the href attribute

  • <exsl:document>, in the value of the href attribute

  • the arguments to XSLT's document() function

It is a common misconception that relative URI references in a document's content are considered to be relative to the processor's current working directory. They are actually resolved relative to the URI of the document that contains the relative URI reference (more specifically, relative to the URI of the entity in which the reference occurs, keeping in mind that a document may be comprised of multiple entities, i.e., separate files).

In all cases, the document URI that you supply in the reader API must be "absolute", which means that it has a scheme, e.g. "http://spam/eggs.xml", not just "/spam/eggs.xml" or "eggs.xml".

If you know there are not going to be any relative URI references to resolve during initial parsing or during processing of the Domlette by other tools, then you can safely omit the argument, or, preferably, supply a dummy URI like "urn:dummy" or "http://spam/eggs.xml". If you choose to omit URI arguments from APIs that need them, you may get a Python warning, and a random URI—which is probably not what you want—will be assigned.

If you've understood all this and yet you want to just go ahead and not specify a base URI, you may have to turn off the likely warnings. You can do so with code such as in the following example.

import  Ft.Xml.Domlette
import warnings
def disable_warnings(*args): pass

warnings.filterwarnings("ignore", category=Warning)
warnings.showwarning = disable_warnings

XML = "<spam/>"
doc  = Ft.Xml.Domlette.NonvalidatingReader.parseString(XML)
Ft.Xml.Domlette.Print(doc)

You can also in such a case use the convenience function Ft.Xml.Parse (see above).

3.1.4 Parsing XML that's already a Unicode string

Because 4Suite is trying to provide as thin a wrapper as possible to the underlying parser, and due to complexities in the APIs of these parsers, there is no API in 4Suite for parsing Python's Unicode strings.

If your XML is in the form of a Unicode string, you must encode the string as bytes so that the underlying parser can read it. Once you have an encoded string, you can pass it to the reader's parseString(), or wrap it in an InputSource using Ft.Xml.CreateInputSource, or the fromString() method of an InputSourceFactory. If the string is not UTF-16 or UTF-8 encoded, then you must tell the reader what encoding it actually uses. You can do this either by writing or replacing the XML declaration in the string itself, or (much easier) setting the optional encoding keyword argument in the reader's parseString() method or the InputSourceFactory's fromString() method. For an example, see the Akara article on external encoding declarations.

3.1.5 NonvalidatingReader

Use NonvalidatingReader for basic parsing. NonvalidatingReader performs its parsing without validating against a DTD.

The following example will parse an XML source taken from the supplied URI, which is treated as a URL by the default resolver.

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri(
  "http://www.w3.org/2000/08/w3c-synd/home.rss")

The following example also parses an XML source taken from the supplied URI, which is treated as a URL. In this case, the default resolver tries to read the XML source from the filesystem.

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseUri("file:///tmp/spam.xml")

The following example parses XML from the filesystem. When given a relative file path in the local OS's format, we must first convert that path to a URI that our reader objects can use.

from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Lib import Uri
file_uri = Uri.OsPathToUri('spam.xml')
doc = NonvalidatingReader.parseUri(file_uri)

The following example parses XML from a string. Note that it does not provide a document/base URI.

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs</spam>")

In the following example, we are parsing XML from a string in a case where the document does need a base URI to be specified.

from Ft.Xml.Domlette import NonvalidatingReader
s = """<!DOCTYPE spam [ <!ENTITY eggs "eggs.xml"> ]>
<spam>&eggs;</spam>"""
doc = NonvalidatingReader.parseString(s, 'http://foo/test/spam.xml')
# during parsing, the replacement text for &eggs;
# will be obtained from http://foo/test/eggs.xml

In all of the above examples, doc is now a Domlette node object. 4Suite currently offers one Domlette implementation, written in C, called cDomlette.

3.1.6 EntityReader Examples

Sometimes you need to parse a fragment of XML rather than the full document. If operating in non-validating mode is sufficient, Domlette has a reader that can handle this case. When parsing such a fragment, EntityReader returns a Domlette document fragment rather than a document object.

from Ft.Xml.Domlette import EntityReader
s = """
<spam1>eggs</spam1>
<spam2>more eggs</spam2>
"""
docfrag = EntityReader.parseString(s, 'http://foo/test/spam.xml')
Note

The content parsed by EntityReader must be an XML External Parsed Entity. This means that it can't be just any XML document. The main limitation is that it must not have a document type declaration.

3.1.7 ValidatingReader

If you want to validate a document with a DTD as you parse it, use the ValidatingReader object instead. If ValidatingReader discovers that the document that it is currently parsing is invalid, then it throws a Ft.Xml.ReaderException and does not finish parsing the document. The following example illustrates these concepts.

# ValidatingReader is a global instance
from Ft.Xml.Domlette import ValidatingReader

XML = """<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b/></a>"""

doc = ValidatingReader.parseString(XML, "urn:x-example:valid-a")
# And of course, as with other readers, you can use `parse`, `parseUri`, and
# `parseStream` as well.

# The following document, however, is invalid because an `a` element can only
# have two `b` children according to its DTD.
XML = """<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b/><b/></a>"""

# This throws a `Ft.Xml.ReaderException` when it encounters invalid structure,
# and does not finish parsing the document into `doc`.
doc = ValidatingReader.parseString(XML, "urn:x-example:invalid-a")

3.1.8 NoExtDtdReader

When using NonvalidatingReader to parse a document, that document's DTD is still opened and read to obtain information such as entity declarations and default attribute values. You cannot suppress reading of the internal DTD subset, but you can prevent the external subset from being accessed by using NoExtDtdReader. This won't affect the processing of external parameter entities defined in the internal DTD subset. Use this object as you would use NonvalidatingReader.

3.1.9 Creating your own reader instance

In some cases you might not want to use the global reader instances. For instance in multithreaded use, you might want a reader per thread. Or you might want to change some of the parameters on the readers. If so, you can create your own reader instance:

from Ft.Xml.Domlette import NonvalidatingReaderBase
reader = NonvalidatingReaderBase()
doc = reader.parseUri("http://xmlhack.com/read.php?item=1560")

Instead of NonvalidatingReaderBase, you could instead use NoExtDtdReaderBase or ValidatingReaderBase, depending on your needs. Each of these 3 readers take an optional inputSourceFactory constructor argument, which you can use to supply a custom URI resolver.

3.1.10 InputSource objects

All of the previous examples involve parsing URIs or strings of data. You can also handle InputSource objects. An InputSource is an object that encapsulates a source of encoded text for parsing, and a URI resolver. The advantage to using an InputSource is that it provides a standard API to the text stream, and—perhaps more importantly—allows you to associate a custom URI resolver with the stream.

Normally, you can just get an InputSource by calling the convenience function Ft.Xml.CreateInputSource with a single argument, which is a string (not Unicode object), file-like object (stream), file path or URI. You can then pass the InputSource object to the reader's parse() method, as in the following example.

from Ft.Xml import InputSource, CreateInputSource
from Ft.Xml.Domlette import NonvalidatingReader

#
# Use CreateInputSource to parse a URL:
#
isrc = CreateInputSource("http://xmlhack.com/read.php?item=1560")
doc1 = NonvalidatingReader.parse(isrc)
#
# Or a string:
#
isrc = CreateInputSource("<spam>eggs</spam>", "http://spam.com/base")
doc2 = NonvalidatingReader.parse(isrc)
#
# InputSource is a file-like object, so you can treat it as such:
#
isrc = CreateInputSource("http://xmlhack.com/read.php?item=1560")
raw_text = isrc.read()
#
# The uri/system ID you used for it is maintained
#
print isrc.uri
#
# You can also create other InputSources from URIs relative to this one
#
isrc2 = isrc.resolve("read.php?item=1703")

If you need lower-level control you can use an InputSourceFactory instance, calling the appropriate method: fromUri(uri), fromString(st), or fromStream(stream), much like the reader API described earlier. The following listing is functionally equivalent to the above one.

from Ft.Xml import InputSource
from Ft.Xml.Domlette import NonvalidatingReader

factory = InputSource.DefaultFactory
isrc = factory.fromUri("http://xmlhack.com/read.php?item=1560")
doc1 = NonvalidatingReader.parse(isrc)
#
# The factory is reusable. Here we also parse a string:
#
isrc = factory.fromString("<spam>eggs</spam>", "http://spam.com/base")
doc2 = NonvalidatingReader.parse(isrc)
#
# InputSource is a file-like object, so you can treat it as such:
#
isrc = factory.fromUri("http://xmlhack.com/read.php?item=1560")
raw_text = isrc.read()
#
# The uri/system ID you used for it is maintained
#
print isrc.uri
#
# You can also create other InputSources from URIs relative to this one
#
isrc2 = isrc.resolve("read.php?item=1703")

3.1.11 Converting from other DOM libraries

You can convert another Python DOM object (e.g. 4DOM or minidom) to a Domlette object using the function ConvertDocument:

from Ft.Xml.Domlette import ConvertDocument
converted_document = ConvertDocument(oldDocument, documentURI=u'http://www.example.org/')

The DocumentURI parameter provides a base URI for the converted nodes. If not specified, attributes documentURI and then baseURI are checked in the source DOM, as defined in DOM Level 3. If no URI is found in this way, a warning is issued and a UUID URI is generated for the new Domlette object.

3.2 Domlette API summary
Interacting with Domlette documents

You will use a large part of the Domlette API to interact with the model of your XML documents. The implementation of this part of the API is found in the Ft.Xml.cDomlette module. This part of the API allows you to navigate around a document and modify the content of that document. It is very similar to the DOM Level 2 specification and follows some of the DOM Level 3 specification; feel free to refer to those specifications and the 4Suite API documentation for details about the intended behavior of this API. You can find brief descriptions of the methods and attributes provided by this API listed below. This API is also nearly the same as the API for xml.dom, which is bundled with Python. The node type constants are inherited directly from xml.dom.Node.

Many objects that you will work with in the Domlette API are descendents of the Domlette Node class. Documents, document fragments (of class DocumentFragment), Elements, attributes (class Attr), text (class Text), processing instructions (class ProcessingInstruction), and comments (class Comment) are all nodes; any node operations are defined on objects of these types, as well. Some operations do not make sense on some objects, however. For example, it does not make sense to add children to an attribute node.

In the DOM model of XML documents, there is a Document node which represents the starting point for the other pieces of the document. This node is not the root element of the document; rather, the Document node contains the root element as its only element child. The Document node may have other children, though, such as processing instructions and comments.

You can easily access properties of a node directly. The following properties are available on any node. These properties generally store information about the structure of the document in the near "vicinity" of the target node.

Properties available on every Node object

attributes

This is a python dictionary containing the attributes defined on the target node. The key for the dictionary is a tuple containing the namespace and local name of the attribute. The value associated with this attribute name tuple is the attribute (of class Attr) itself.

node = Parse("<foo a='1'/>")
print node.childNodes[0].attributes
{(None, u'a'): <Attr at 0x40870ecc: name u'a', value u'1'>}
baseURI

This is the base URI in scope for the target node as a Python unicode string. It is read-only and is computed dynamically according to DOM L3 Core.

childNodes

This is the Python list of all the node children of the target node. Note that in DOM terminology, the attributes of a node are not children of that node.

node = Parse("<foo a='1'/>")
print node.childNodes
[<Element at 0x4086052c: name u'foo', 1 attributes, 0 children>]
firstChild

This is the first child node of the target node. This is equivalent to childNodes[0], and is a useful property for quickly walking the document tree.

node = Parse("<foo a='1'/>")
print node.firstChild
<Element at 0x40860a6c: name u'foo', 1 attributes, 0 children>
lastChild

This is the last child node of the target node. This is equivalent to childNodes[-1].

node = Parse("<foo a='1'/><!--Hi!-->")
print node.lastChild
<Comment at 0x4087caf4: u'Hi!'>
localName

This is the local name of the target node as a Python unicode string.

namespaceURI

This is the namespace URI of the target node as a Python unicode string.

nextSibling

This is the node immediately following the target node, or None if the target node is the last child of its parent (or if the target node is an attribute, as attributes are unordered).

nodeValue

This is the value of the target node as a Python unicode string, if the target node has a string value. If not, this is None. To illustrate some of the possibilities, attributes and text nodes have values, while elements and documents do not.

ownerDocument

This is the Document node in which the target node is contained.

parentNode

This is the parent of the target node. If the target node is a Document node, then this will be None; Document nodes do not have parents.

prefix

This is the namespace prefix of the current node, or None if the current node does not (or cannot) have a namespace prefix.

previousSibling

This is the node immediately preceding the target node, or None if the target node is the first child of its parent (or if the target node is an attribute, as attributes are unordered).

rootNode

This is a synonym for ownerDocument.

xmlBase

This is a synonym for baseURI.

In addition to accessing the structure relative to a node, there are also a set of operations that we can perform on these structures, including a variety of operations for modifying the document. Some of these methods allow you to add new nodes in various places; note that in the DOM, only Document nodes can create new nodes. See “Methods available to Document objects” for details. The following methods are available on any node.

Methods available to every Node object

appendChild(node)

This method adds node as the last child of the current instance. This is useful for manually building a document in breadth-first document order.

insertBefore(newChild, refChild)

This method adds the node newChild to the current instance immediately before child node refChild.

replaceChild(newChild, oldChild)

This method replaces the child node oldChild with the newChild node.

removeChild(oldChild)

This method removes the oldChild node as a child of the instance node.

cloneNode(deep)

This method returns a new copy of the current instance. If (and only if) deep is true, then we copy deeply: the node's attributes and children are also copied deeply.

isSameNode(otherNode)

This method determines whether the instance node and otherNode are the same node based upon object identity.

normalize()

This method merges any adjacent text nodes in the attributes or descendents of the current instance.

hasChildNodes()

This method returns true if and only if the instance node has any child nodes.

xpath(expr, explicitNss)

This method evaluates the XPath expression expr with the current instance as the expression context and returns an appropriately-valued result. The explicitNss parameter is optional; it is a Python dictionary mapping namespace prefixes to namespaces for use in the expression. See “XPath queries” for details.

In addition to their behavior as nodes, Document nodes are uniquely responsible for a number of tasks. For example, only Document nodes can create other nodes. The following methods are availble only to Document nodes.

Methods available to Document objects

createElementNS(namespaceURI, qualifiedName)

This method creates and returns a new Element with the given namespace URI and qualified name.

createAttributeNS(namespaceURI, qualifiedName)

This method creates and returns a new attribute (Attr object) with the given namespace URI and qualified name.

createTextNode(data)

This method creates and returns a new Text node with the string value of data.

createProcessingInstruction(target, data)

This method creates and returns a new processing instruction (ProcessingInstruction object) with the given target name and contents taken from data.

createComment(data)

This method creates and returns a new Comment with the string value of data.

createDocumentFragment()

This method creates and returns a new, empty document fragment (DocumentFragment object).

importNode(importedNode, deep)

Nodes can only belong to one document at a time. This method creates a copy of the node importedNode that belongs to the instance (but which does not yet have a parent). If (and only if) deep is true, then we copy deeply: the node's attributes and children are also copied deeply and imported.

Document nodes also have a number of properties that are not found on other nodes. These properties are summarized in the following list.

Properties available on Document objects

doctype

This is a DocumentType object that encapsulates info about the document's "type", as described in its DOCTYPE tag. In Domlette, which doesn't use such objects, the value of the doctype property will always be None.

documentElement

This is the root element of the document.

documentURI

This is the URI that identifies the document.

implementation

This is the DOMImplementation that created the document.

publicId

This Domlette-specific property is the public ID of the DTD of this document.

rootNode

This refers to the current instance.

systemId

This Domlette-specific property is the system ID of the DTD of this document.

unparsedEntities

This is the list of unparsed entities in the current document.

Attributes (Attr objects) do not have any special methods, but they do have a few additional properties. These properties are summarized in the following list.

Properties available on Attr objects

name

This is the qualified name of the current instance.

nodeName

This is a synonym for the name property.

ownerElement

This is a synonym for the parentNode property.

specified

You will probably never need this property. It is always 1. DOM says it should be 0 if it is present through defaulting, rather than explicitly specified in the document. This is only possible if the DOM implementation preserves certain details from DTD processing, which 4Suite never does. Therefore the value is always 0.

value

This is a synonym for the nodeValue property.

Since attributes can only be attached to elements, Element objects have a set of special methods for managing which attributes are attached to them. We describe these methods below.

Methods available to Element objects

hasAttributeNS(namespaceURI, localName)

This method returns true if the current instance has an attribute with the given namespace URI and local name, and false otherwise.

getAttributeNS(namespaceURI, localName)

This method returns the attribute value of the attribute with the given namespace URI and local name, if one exists. If not, this returns None.

getAttributeNodeNS(namespaceURI, localName)

This method returns the Attr object of the attribute with the given namespace URI and local name, if one exists. If not, this returns None.

removeAttributeNS(namespaceURI, localName)

This method removes the attribute with the given namespace URI and local name from the current instance element.

removeAttributeNode(node)

This method removes the attribute node from the current instance element.

setAttributeNS(namespaceURI, qualifiedName, value)

This method adds an attribute or replaces an attribute with the specified namespace URI and qualified name and sets the content of that attribute to value.

setAttributeNodeNS(node)

This method adds or replaces an attribute using the Attr object node.

Elements also have several properties above and beyond what they get from being Nodes. See the list below for details.

Properties available on Element objects

nodeName

This is the qualified name of the current instance.

tagName

This is a synonym for nodeName.

Both Text and Comment nodes are also more general CharacterData nodes in the DOM. CharacterData nodes have several additional properties and methods for managing the string data that they contain. The individual Text and Comment nodes, however, do not add any functionality to their general CharacterData parent class. You can find descriptions of the properties and methods offered by CharacterData objects below.

Properties available on CharacterData objects

data

This is the string content of the current instance.

length

This is the length of the string content of the current instance.

nodeValue

This is a synonym for data.

Methods available to CharacterData objects

insertData(offset, data)

This method inserts the string data into the content of the current instance at the index specified by offset.

appendData(data)

This method appends the string data to the end of the value of the current instance.

replaceData(offset, count, data)

This method replaces count number of characters found at index offset in the current instance with the string data.

substringData(offset, count)

This method retrieves and returns the part of the string value of the current instance that begins at index offset and extends count characters.

deleteData(offset, count)

This method deletes the part of the string value of the current instance that begins at index offset and extends count characters.

A few DOM actions are not "owned" by any individual document. In effect, they are general-purpose operations. They can be found in DOMImplementation objects. One such precreated instance can be conveniently found at and used from Ft.Xml.Domlette.implementation. The general methods that such a DOMImplementation object offers are listed below.

DOMImplementation methods:

createDocument(namespaceURI, qualifiedName, doctype)

This standard DOM method creates and returns a Document object associated with the given DocumentTyype object, and having a single element child with the given QName and namespace. Since Domlette does not use DocumentTyype objects, the doctype argument must be given as None.

createRootNode(documentURI)

This Domlette-specific method creates a Document object with the specified document (base) URI. No document element is created. This method is generally preferred over createDocument(); see the following section, 'Building a DOM from scratch'.

hasFeature(feature, version)

This method tests whether the DOM implementation implements a specific feature.

3.2.1 What about getElementsByTagName()?

The getElementsByTagName() method isn't supported, because there are better options. In particular, you can just use XPath:

doc.xpath(u"//tagname")

For more possibilities, see getElementsByTagName Alternatives.

3.3 Serializing Domlette nodes

Domlette comes with a couple of very fast printer functions which also go to great pains to correctly handle character encoding issues: Print and PrettyPrint. Here are some serialization examples using the Domlette printers, given a node 'node' (it doesn't have to be a document node).

from Ft.Xml.Domlette import Print, PrettyPrint

# basic serialization to sys.stdout
Print(node)

# ... with extra whitespace (indenting)
PrettyPrint(node)

# ... using a single tab, rather than 2 spaces, to indent at each level
PrettyPrint(node, indent='\t')

# serializing to a utf-8 encoded file
f = open('output.xml','w')
Print(node, stream=f)
f.close()

# ... to an iso-8859-1 encoded file
f = open('output.xml','w')
Print(node, stream=f, encoding='iso-8859-1')
f.close()

# ... to an ascii encoded string
import cStringIO
buf = cStringIO.StringIO()
Print(node, stream=buf, encoding='us-ascii')
buf.close()
s = buf.getvalue()

# Normally, output syntax (XML or HTML) is chosen based on the DOM type,
# which is automatically detected. A Domlette or XML DOM can be output in
# HTML syntax if the asHtml=1 argument is given.
PrettyPrint(node, asHtml=1)

See also: Serializing XML from DOM or Domlette documents

3.4 Building a DOM from scratch

As an alternative to parsing a preexisting XML document, you can also build a document model, with certain limitations, from the ground up. W3C and Python DOM facilities for doing this are intended mainly for creating a temporary document whose nodes will be imported into an existing document, and while Domlette does offer a more convenient document creation method, it has many of the same limitations. However, for most documents, its capabilities should be sufficient.

The Ft.Xml.Domlette module contains a DOMImplementation instance named implementation which provides a set of methods for initializing new Documents. The implementation.createRootNode method takes a base URI argument and provides a natural approach for creating an XPath model root node. This is similar to the DOM idea of a document node and even closer to a DOM document fragment (multiple element children are allowed). The implementation.createDocument method, on the other hand, is designed to come close to the DOM interface, although its doctype argument must be None.

doc = implementation.createRootNode('file:///article.xml')

is the equivalent of

from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createDocument(EMPTY_NAMESPACE, None, None)

with the added advantage of doc.baseURI being set to 'file:///article.xml', which is not possible to set via standard DOM interfaces (the baseURI attribute is read-only).

Similarly,

from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createRootNode('file:///article.xml')
docelement = doc.createElementNS(EMPTY_NAMESPACE, 'article')
doc.appendChild(docelement)

is the equivalent of

from Ft.Xml import EMPTY_NAMESPACE
doc = implementation.createDocument(EMPTY_NAMESPACE, 'article', None)

plus doc.baseURI being set to 'file:///article.xml'.

If you want as much fidelity to the DOM API as Domlette offers, use implementation.createDocument. If you just want to create a document or other such root-level node, and never mind the strange parameters, use implementation.createRootNode.

3.5 XPath query

You can easily perform XPath queries by use the xpath method for cDomlette nodes as follows:

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs<a/><a/></spam>")
print doc.xpath(u'//a')
print doc.xpath(u'string(/spam)')

Notice: this is nothing like W3C DOM's XPath query module. The emphasis, as usual with Domlette, is on speed, simplicity and pythonic-ness.

The API, in brief:

node.xpath(expr[, explicitNss])

  • node - will be used as core of the context for evaluating the XPath

  • expr - XPath expression in string or compiled form

  • explicitNss - (optional) any additional or overriding namespace mappings in the form of a dictionary that maps prefixes to namespace URIs. The base namespace mappings are taken from in-scope declarations on the given node. This explicit dictionary is superimposed on the base mappings.

For additional details, see “XPath queries”.

3.6 More on base URIs

For some users, always specifying a base URI feels like an inconvenience. Perhaps they always generate XML sources from text or streams without naturally associated URIs, and they have to figure out schemes to come up with base URIs for the parse. But there is good reason for this pickiness. Just ask one of the users who got bitten by carelessness with base URIs in practice. It's better to always put some amount of thought into base URIs when processing XML, and 4Suite encourages this.

Note that 4Suite only enforces the requirement for base URIs in cases where they are needed to make sense of a requested operation. Your document must have a valid base URI if you use external entities, XInclude, xsl:import, xsl:include, the XSLT document() function, the EXSLT exsl:document element, or any other operations that require access to an external resource. If your main use for URI resolution is XSLT import and includes, you can avoid having to give valid base URIs by using XSLT include paths.

A valid base URI starts with a scheme, such as http:. A simple name, such as "spam" is a valid relative URI reference, but not a valid base URI. Without a base URI, a relative reference is no more useful than an apartment number given without the address of the entire apartment building. Merging a base URI with a relative reference is a string operation that is undertaken in a standard manner, and is generally only useful when the base URI is hierarchical; that is, it is a URL using one of the common schemes that have slashes as path separators (e.g., http:, ftp:, gopher:, and most file: URLs). The built-in 4Suite URI resolver Ft.Lib.Uri.BASIC_RESOLVER knows how to perform such resolution.

3.7 Why does Domlette diverge from the DOM specification?

Domlette is not a complete or fully conformant DOM implementation, but it does provide an interface very close to W3C DOM Level 2 and the corresponding Python mapping as laid out in the xml.dom API docs.

The areas of divergence are inconsequential for most users, and generally reflect decisions made in the interest of eliminating redundancy, inefficiency, and, to some degree, un-Pythonic design. Also, one of the important design principles for Domlette is that where DOM and XPath disagree, XPath wins; aside from making things more efficient to implement, this behavior is generally what people want in an XML document model.

It is also worth noting that in the interest of usability, all DOM implementations exhibit some degree of variation from the specs. Coding a completely implementation-agnostic DOM application is difficult and usually unnecessary.

4 SAX

Saxlette is a fast SAX implementation, all written in C. Its API is similar to those of Python's built-in SAX.

from xml import sax
from Ft.Xml import CreateInputSource

class element_counter(sax.ContentHandler):
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

parser = sax.make_parser(['Ft.Xml.Sax'])
handler = element_counter()
parser.setContentHandler(handler)
#'file:ot.xml' or file('ot.xml') or file('ot.xml').read() would work just as well, of course
parser.parse(CreateInputSource('ot.xml'))
print "Elements counted:", handler.ecount

If you don't care about PySax compatibility, you can use the more specialized API, which involves the following lines in place of the equivalents above:

from Ft.Xml import Sax
...
class element_counter:
....
parser = Sax.CreateParser()

The biggest API differences between Saxlette and PySax are that Saxlette only supports SAX 2. For example, feature_namespaces is hard-wired to True and feature_namespace_prefixes to False (which is exactly what SAX2 says is required). Saxlette also combines all adgacent text events, which eliminates one of the pain points of PySax.

The argument to the parse method is a URI, a SAX input source or a 4Suite input source. In the example above a URI was used. The following example shows similar code using 4Suite's Ft.Xml.InputSource.

from Ft.Xml import InputSource, Sax
factory = InputSource.DefaultFactory
isrc = factory.fromUri("file:ot.xml")
doc1 = NonvalidatingReader.parse(isrc)

class element_counter:
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

parser = Sax.CreateParser()
handler = element_counter()
parser.setContentHandler(handler)
parser.parse(isrc)
print "Elements counted:", handler.ecount

4.1 Validating a document while parsing it using SAX

To enable validation of your documents while otherwise parsing them normally with SAX, set the xml.sax.handler.feature_validation feature to True on your parser using a line similar to parser.setFeature(xml.sax.handler.feature_validation, True). The parser will then throw an xml.sax._exceptions.SAXParseException exception if it determines that the document is invalid, and it will stop parsing the document. Handlers for document components that have been parsed will be called, however. The following example illustrates these concepts.

from Ft.Xml import InputSource, Sax
factory = InputSource.DefaultFactory

XML = """<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b/></a>"""

isrc = factory.fromString(XML, 'urn:x-example:valid-a')

class element_counter:
    def startDocument(self):
        self.scount = 0
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.scount += 1

    def endElementNS(self, name, qname):
        self.ecount += 1

parser = Sax.CreateParser()
handler = element_counter()
parser.setContentHandler(handler)
# And now, to enable validation...
import xml
parser.setFeature(xml.sax.handler.feature_validation, True)
parser.parse(isrc)
print "Saw", handler.scount, "start tags"
print "Saw", handler.ecount, "end tags"

# And now we show what happens on an invalid document:
XML = """<!DOCTYPE a [
  <!ELEMENT a (b, b)>
  <!ELEMENT b EMPTY>
]>
<a><b/><b/><b/></a>"""

isrc = factory.fromString(XML, 'urn:x-example:invalid-a')
parser.parse(isrc)
print "Saw", handler.scount, "start tags"
print "Saw", handler.ecount, "end tags"
# The above document is invalid; it has one more `b` element than is
# allowed by the DTD.  The handlers have still been called for those
# parts of the document that have been parsed.

4.2 Walking a DOM to fire SAX events

Saxlette has the ability to walk a Domlette tree, firing off events to a handler as if from a source document parse. This ability used to be too well, hidden, though, and I made an API addition to make it more readily available. This is the new Ft.Xml.Domlette.SaxWalker. The following example should show how easy it is to use:

from Ft.Xml.Domlette import SaxWalker
from Ft.Xml import Parse

XML = "<a><b/><b/></a>"

class element_counter:
    def startDocument(self):
        self.ecount = 0

    def startElementNS(self, name, qname, attribs):
        self.ecount += 1

#First get a Domlette document node
doc = Parse(XML)
#Then SAX "parse" it
parser = SaxWalker(doc)
handler = element_counter()
parser.setContentHandler(handler)
#You can set any properties or features, or do whatever
#you would to a regular SAX2 parser instance here
parser.parse() #called without any argument
print "Elements counted:", handler.ecount

4.3 Building a Domlette from SAX events

Saxlette includes a convenience ContentHandler (Ft.Xml.Sax.DomBuilder) which listens for SAX events and constructs Domlette Documents.

4.4 Feeding a generator from SAX events

Python's generators are special functions that can produce a series of partial results within the course of running. The calling program can start up a generator, which is suspended when a partial result is yielded, and resumed explicitly by the program when the next result is required. This capability is mirrored in the Expat parser that is the basis of Saxlette. Saxlette has a feature, FEATURE_GENERATOR which you can set on a parser object to enable generator semantics. If this feature is set, the parse() method returns an iterator. This iterator yields results set by the the SAX handlers. The handlers specify the partial results by setting the property PROPERTY_YIELD_RESULT with the value to be yielded. As an example, the following code reports the name of all attributes used in the document.

class report_attributes:
    def __init__(self, parser):
        self.parser = parser
        return

    def startElementNS(self, name, qname, attribs):
        self.parser.setProperty(Sax.PROPERTY_YIELD_RESULT, attribs)
        return

from Ft.Xml import Sax, CreateInputSource

parser = Sax.CreateParser()
parser.setFeature(Sax.FEATURE_GENERATOR, True)
handler = report_attributes(parser)
parser.setContentHandler(handler)
attribs_iterator = parser.parse(CreateInputSource('test.xhtml'))
for attribs in attribs_iterator:
     for name in attribs.keys(): print name

4.5 SAX filters

In SAX processing, the parser passes to the application a stream of events that represents the XML content. An important aspect of SAX is the user's ability to create SAX filters, which accept a stream of SAX events and pass on a modified stream. For example, you might use a SAX filter to take look for DOcbook sect1, sect2 etc. elements, and rename them to section elements before passing them on for further processing (presumably by a SAX handler that only understands how to deal with the latter form). You can chain SAX filters as well, and the idea behind SAX filters is usually reuse across a broad array of applications, focusing each filter they on a single task that can be cleanly separated from upstream and downstream processing. SAX filters can thus be useful building blocks for XML pipelines.

from xml import sax
from xml.sax.saxutils import XMLFilterBase
from Ft.Xml import CreateInputSource, XML_NAMESPACE as XMLNS
from Ft.Xml.Sax import SaxPrinter

XML = """<?xml version="1.0" encoding="utf-8"?>
<menu>
  <item id="A" xml:lang="en">Orange juice</item>
  <item id="A" xml:lang="es">Jugo de naranja</item>
  <item id="B" xml:lang="en">Toast</item>
  <item id="B" xml:lang="es">Pan tostada
    <note xml:lang="en">Wheat bread only, please</note>
  </item>
</menu>
"""

#Define constants for the two states we care about
ALLOW_CONTENT = 1
SUPPRESS_CONTENT = 2

class english_only_filter(XMLFilterBase):
    def __init__(self, downstream):
        XMLFilterBase.__init__(self, downstream)
        return

    def startDocument(self):
        #Set the initial state, and set up the stack of states
        self._state_stack = [ALLOW_CONTENT]
        XMLFilterBase.startDocument(self)
        return

    def startElementNS(self, name, qname, attrs):
        #Check if there is any language attribute
        lang = attrs.get((XMLNS, 'lang'))
        if lang:
            #Set the state as appropriate
            if lang[:2] == 'en':
	        self._state_stack.append(ALLOW_CONTENT)
            else:
	        self._state_stack.append(SUPPRESS_CONTENT)
        #Always update the stack with the current state
        #Even if it has not changed
        
        #Only forward the event if the state warrants it
        if self._state_stack[-1] == ALLOW_CONTENT:
            XMLFilterBase.startElementNS(self, name, qname, attrs)
        return

    def endElementNS(self, name, qname):
        self._state_stack.pop()
        #Only forward the event if the state warrants it
        if self._state_stack[-1] == ALLOW_CONTENT:
            XMLFilterBase.endElementNS(self, name, qname)
        return

    def characters(self, content):
        #Only forward the event if the state warrants it
        if self._state_stack[-1] == ALLOW_CONTENT:
            XMLFilterBase.characters(self, content)
        return

if __name__ == "__main__":
    parser = sax.make_parser(['Ft.Xml.Sax'])
    #SaxPrinter is a special SAX handler that merely writes
    #SAX events back into an XML document
    filtered_parser = english_only_filter(parser)
    handler = SaxPrinter()
    filtered_parser.setContentHandler(handler)
    filtered_parser.parse(CreateInputSource(XML))

Most SAX handlers operate as state machines, meaning they manage some variables based on the stream of events that come in, and change behavior based on these variables. english_only_filter is set up to be in one of two states: one in which content is passed on to the downstream handler, and one in which content is suppressed. This state is marked in the self._state_stack. The state is initially set to ALLOW_CONTENT, and changed to SUPPRESS_CONTENT if the filter encounters an xml:lang attribute that represents a language other than English (which can be done by checking the first two characters of the value, according to the rules of standard language codes). It has to be a stack because XML language specifications are scoped, so that in the example XML at the top of the listing the string "Pan tostada" is within the scope of the element with the attribute xml:lang="es", and so it is marked as being in Spanish. The entire note element, however, is marked as being in English by an overriding xml:lang="en" attribute.

The SAX handler is set to Ft.Xml.SaxPrinter, which channels the final SAX evenis onto a 4Suite printer which creates a serialized XML document. It's quite easy to chain filters. If you wanted the parser to send events to a filter of class some_other_filter which then passed on events to english_only_filter the relevant line would look as follows:

    filtered_parser = english_only_filter(some_other_filter(parser))

4.6 Streaming canonicalization

The combination of streaming parsing using Saxlette and streaming serialization using Ft.Xml.Lib.CanonicalXmlPrinter allows for very efficient XML canonicalization (c14n).

import sys
from xml import sax
from Ft.Xml import CreateInputSource
from Ft.Xml.Sax import SaxPrinter
from Ft.Xml.Lib.XmlPrinter import CanonicalXmlPrinter

parser = sax.make_parser(['Ft.Xml.Sax'])
handler = SaxPrinter(CanonicalXmlPrinter(sys.stdout))
parser.setContentHandler(handler)
parser.parse(CreateInputSource('   <a><b b="1" a="2"/></a>   '))

5 XPath queries

4Suite provides an XPath processing engine, compliant with the W3C XPath 1.0 specification. This query engine is accessible through Ft.Xml.XPath.

5.1 The quickest option

If you are using Domlette, as described above, the quickest and easiest way to use the XPath facility in 4Suite is the xpath() method, which any Domlette Node supports:

from Ft.Xml.Domlette import NonvalidatingReader
doc = NonvalidatingReader.parseString("<spam>eggs<a/><a/></spam>")
doc2 = NonvalidatingReader.parseString("<spam>eggs<eggs n='1'> and ham</eggs></spam>")
print doc.xpath(u'(//a)[1]')
print doc.xpath(u'string(/spam)')
print doc2.xpath(u'string(//eggs/@n)')

The line

print doc.xpath(u'(//a)[1]')

Is actually a shortcut for the following more involved construct, which is described in detail in the next section:

from Ft.Xml.XPath import Evaluate
print Evaluate(u'(//a)[1]', contextNode=doc)

This example prints three lines. The first line shows a string representation of a list containing a single element. As we see from this line, an XPath selection of nodes returns a Python list. In this case, it is a list containing a single element—the first element with a local name of a, which has no attributes and no children. The second line shows the correct string value of the selected spam element, and the third line shows the correct string value of the n attribute.

[<Element at 0xb7d10bb4: name u'a', 0 attributes, 0 children>]
eggs
1

5.2 Type mappings

4Suite XPath functions return results with Python types that depend on the XPath data model type of the query result. The following list shows how the five XPath result types (String, number, boolean, node-set and object) are mapped to Python types:

  • XPath string: Python unicode type

  • XPath number: Python float type (int or long also accepted), or instance of Ft.Lib.number.nan (for NaN) or Ft.Lib.number.inf (for Infinity)

  • XPath boolean: Ft.Lib.boolean instance

  • XPath node-set: Python list of Domlette nodes, in document order, with no duplicates

  • XPath foreign object: any other Python object (you will very rarely encounter this case)

5.3 Advanced use

XPath expressions can refer to both variables and qualified names (QNames) that must be defined by the environment that is executing the XPath expression. This section describes how to use these advanced features of XPath using the 4Suite interface.

4Suite's XPath implementation uses a Domlette node as the context node for XPath operations. The following example demonstrates the use of XPath to extract content from an XML document. The document must be parsed before Xpath can be used to access it. The following example parses the XML document and explicitly sets up an XPath context to run an XPath query.

XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""

from Ft.Xml import Parse
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Evaluate

doc = Parse(XML)
ctx = Context(doc)
nodes = Evaluate(u'//em', ctx)

# The return value, a node set, comes back as a Python list of nodes
# which may be accessed using an iterator
for n in nodes:
    # print dir(n)
    print n.tagName
    print n.firstChild.nodeValue

XPath always requires a context for execution; a common XPath context is the root of the target document, such as we did in the above example. Think about an XPath query being executed from some location in an XML document. This location in the document is a necessary component of using XPath.

There is more to an XPath context than just the context node, but if your needs are as straightforward as that of the above example, there is an abbreviated version of the Evaluate method for this purpose. For example, the following fragment is equivalent to the two lines creating a context and evaluating the expression in the above example.

# No need to create a context object
Evaluate(u'//em', contextNode=doc)

If your source document uses XML Namespaces you will likely need to use QNames in your XPath expressions. For this to work, you'll need to introduce namespace mappings into your XPath context. For example, if the elements of our XML document above are in an XML namespace, then we must set up our context slightly differently.

XML = """<ham xmlns="http://example.com/ns#">
<eggs n='1'/>
This is the string content with <em type='bold'>emphasized Namespaced Text</em> text
</ham>"""

from Ft.Xml import Parse
from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Evaluate

NSS = {u'ex': u'http://example.com/ns#'}
doc = Parse(XML)
ctx = Context(doc, processorNss=NSS)
nodes = Evaluate(u'//ex:em', ctx)
for n in nodes:
    # print dir(n)
    print n.tagName
    print n.firstChild.nodeValue

You define XPath namespace prefixes through a Python dictionary (NSS in the above example) which maps these prefixes, such as 'ex' in the above example, to the appropriate namespace URI, such as 'http://example.com/ns#' in the above example. This prefix mapping is added to your XPath context using the processorNss parameter to the Context function.

In a similar way, you can also pass in variable bindings which may be used as values later in your XPath expressions. In this case, however, variables are Python tuples containing the namespace URI and local name of the variable.

ctx = Context(node, varBindings=
  {(EMPTY_NAMESPACE, u'date'): u'2003-06-20'})
Evaluate('event[@date = $date]', context=ctx)

This creates a variable in the default namespace named 'date', with a value of '2003-06-20'; this is then used for comparison with the date attribute in the Xpath expression.

XPath variables are Qnames, so you pass in variable names as namespace/local name tuples. The values can be numbers, unicode objects or boolean objects:

from Ft.Xml.XPath import boolean
ctx = Context(node, varBindings={(EMPTY_NAMESPACE, u'test'): boolean.true})

This sets the variable 'test' to the boolean value true (remember that this is for the XPath environment, not the Python one), and again this may be used as in any XSLT stylesheet.

If you only want a value once, you may of course still use string constants, as in

nodes=Evaluate(u'//testPrefix:em[@type="bold"]',ctx)

Note the quotes used? These must be balanced, hence the literal value uses double quotes.

5.4 Reusing parsed XPath queries

Sometimes you want to re-use an XPath expression and namespace mapping multiple times, for efficiency and convenience. The following example shows an example of this:

from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml import Parse

DOCS = ["<spam xmlns='http://spam.com'>eggs</spam>",
        "<spam xmlns='http://spam.com'>grail</spam>",
        "<spam xmlns='http://spam.com'>nicht</spam>",
       ]

# Pre-compile for efficiency and convenience
expr = Compile(u"/a:spam[contains(., 'i')]")
ctx = Context(None, processorNss={u"a": u"http://spam.com"})

i = 1
for doc in DOCS:
    doc = NonvalidatingReader.parseString(doc.encode('UTF-8'),
                                          "http://spam.com/base")
    retval = Evaluate(expr, doc, ctx)
    if len(retval):
        print "Document", i, "meets our criteria"
    i += 1

Which should display:

Document 2 meets our criteria
Document 3 meets our criteria

5.5 Migration from PyXML's XPath

There is a usable XPath module in PyXML (warning: PyXML's XSLT implementation is not usable: use 4Suite if you need XSLT), but there are a lot of updates and improvements in the XPath library version in 4Suite.

If you are familiar with PyXML, you may have used a different form of imports to load in XPath and XSLT features. The imports are different under 4Suite.

Usage example:

  1. PyXML usage (do not use with 4Suite):

    import xml.xslt
    import xml.xpath
  2. 4Suite usage (use these imports):

    import Ft.Xml.XPath
    import Ft.Xml.Xslt

6 XSLT processing

6.1 The super-simple XSLT API

For basic XSLT transform needs, or to get started quickly, the Ft.Xml.Xslt module offers a quick way to apply transforms XML documents and get back the simple string result. Within this module, the function of interest is Transform.

Transform(fname_or_uri, string_stream_fname_uri_isrc, [param], [output])

The Transform function takes two arguments, with an optional third. The first is the source XML for the transform. The second is the XSLT document. Both are given as a string, an object like an open file, a local file path on your computer, an absolute URI, or an InputSource object. The optional params is a dictionary of stylesheet parameters, the keys of which may be given as unicode objects if they have no namespace, or as (uri, localname) tuples if they do. The values are the overriden parameter values. If you do not supply the optional output parameter the return value is a string with the result of this transform. If you do supply this parameter it must be a file-like object to which the output will be written, and then the return value is None.

XML = """
<ham>
<eggs n='1'/>
This is the string content with <em>emphasized text</em> text
</ham>"""

from Ft.Xml.Xslt import Transform
# URL for the identity transform: reproduces the input XML in the result
ID_TRANSFORM = 'http://cvs.4suite.org/viewcvs/*checkout*/4Suite/Ft/Data/identity.xslt'

result = Transform(XML, ID_TRANSFORM)
print result

# If the above XML document were located in the file
# "target.xml", we could have used `Transform("target.xml", ID_TRANSFORM)`.

#It's more efficient to redirect the processor output to an output stream.  The following does so:
import sys
result = Transform(XML, ID_TRANSFORM, output=sys.stdout)
print result

6.2 Full XSLT processing API

Here is the general procedure for using the Python API for XSLT processing:

  1. Create an Ft.Xml.Xslt.Processor.Processor instance.

  2. Prepare Ft.Xml.InputSource instances (via their factory) for the source XML and stylesheet.

  3. Call the Processor's appendStylesheet method, passing it the stylesheet's InputSource.

  4. Call the Processor's run method, passing it the source document's InputSource.

For input to our transform, we will use the namespaced example as in the last section.

$ cat testNS.xml
<ham xmlns="http://example.com/ns#">
<eggs n='1'/>
This is the string content with
 <em type='bold' f='2'>emphasized Namespaced Text</em>
text
</ham>

For our stylesheet, we will again use one of the simplest useful examples, the identity stylesheet.

$ cat identity.xsl
<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

The code below follows the processing outline, having converted the input file and stylesheet to the URI format.

from Ft.Xml.Xslt import Processor
# We use the InputSource architecture
from Ft.Xml import InputSource
from Ft.Lib.Uri import OsPathToUri  # path to URI conversions

processor = Processor.Processor()

# Prepare an InputSource for the source document
# Convert from local file to uri
srcAsUri = OsPathToUri('testNS.xml')
source = InputSource.DefaultFactory.fromUri(srcAsUri)

# Prepare an InputSource for the stylesheet
# Convert from local file to uri
ssAsUri = OsPathToUri('identity.xsl')
transform = InputSource.DefaultFactory.fromUri(ssAsUri)

processor.appendStylesheet(transform)
result = processor.run(source)

# result is a string with the serialized transform result
print result

You can call run multiple times on different InputSources. When you're done, the processor's reset method can be used to restore a clean slate (at which point you would have to append stylesheets to the processor again).

The following example uses our processor from the previous example to transform a new XML document, this one constructed manually.

XML = """<foo><bar/></foo>"""
source = InputSource.DefaultFactory.fromString(XML, 'http://example.org/foo')

result = processor.run(source)

# result is a string with the serialized transform result
print result

This code continues from the previous example to process the second document, using the same processor and stylesheet. This is a useful form when there is a requirement for server side processing of multiple input documents with a common stylesheet.

6.3 Example

In the example below, strings are used as the source of the transform (stylesheet) and source documents, and we are careful to pass in a URI to identify each of them. In the source document, the URI is needed for resolving external entity references and XIncludes. In the stylesheet, the URI is needed for resolving document function calls, xsl:includes and xsl:imports.

If you do not provide a URI and you attempt to use any of these features, you may get an exception.

# The identity transform: duplicates the input to output
TRANSFORM = """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>"""

SOURCE = """<spam id="eggs">I don't like spam</spam>"""

# The processor class is the core of the XSLT API
from Ft.Xml.Xslt import Processor
processor = Processor.Processor()

# We use the InputSource architecture
from Ft.Xml import InputSource

# Prepare an InputSource for the transform
transform = InputSource.DefaultFactory.fromString(TRANSFORM,
  "http://spam.com/identity.xslt")

# Prepare an InputSource for the source document
source = InputSource.DefaultFactory.fromString(SOURCE,
  "http://spam.com/doc.xml")
processor.appendStylesheet(transform)
result = processor.run(source)

# result is a string with the serialized transform result
print result

6.4 Using Domlette objects instead of InputSources

If your documents are already in the form of Domlette documents, you don't need to create InputSources for them; you can just use the Processor's appendStylesheetNode and runNode methods instead of appendStylesheet and run, respectively.

Note

It is usually slower to read the stylesheet from a Domlette object than to parse a serialized document.

Note

The Domlette documents used in the following example are obtained by parsing existing XML, but this approach can just as easily be used on Domlette documents that are built programmatically (i.e. using the DOM API).

# The identity transform: duplicates the input to output
TRANSFORM = """
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">

  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>"""

SOURCE = """<spam id="eggs">I don't like spam</spam>"""

from Ft.Xml.Xslt import Processor
processor = Processor.Processor()
from Ft.Xml.Domlette import NonvalidatingReader

# Create a DOM for the transform
transform = NonvalidatingReader.parseString(TRANSFORM,
  "http://spam.com/identity.xslt")

# Create a DOM for the source document
source = NonvalidatingReader.parseString(SOURCE, "http://spam.com/doc.xml")
processor.appendStylesheetNode(transform, "http://spam.com/identity.xslt")
result = processor.runNode(source, "http://spam.com/doc.xml")
print result

If you have objects from another DOM library, you can first convert them to Domlette objects as shown in “Converting from other DOM libraries”.

6.5 Top-level parameters
Passing parameters to a stylesheet

You can pass in stylesheet parameters as a Python dictionary. Use the parameter names for keys. Values use the 4Suite XPath library's standard type mappings, which are described in “Type mappings”.

Parameter and variable names in XPath/XSLT are actually expanded-names, which we represent as (namespaceURI, localName) tuples. If your parameter name is in a namespace, you have to use a tuple as the mapping key. Otherwise, you may simply use a unicode string that represents the local-name part only (Ft.Xml.EMPTY_NAMESPACE is the default namespace).

Here is an example, which passes in the computed "date" parameter to the stylesheet from the program:

SRC = """<?xml version="1.0"?><dummy/>"""

STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:param name="date" select="'unknown'"/>

  <xsl:output method="xml" indent="yes" encoding="us-ascii"/>

    <xsl:template match="/">
      <result>
        <xsl:value-of select="$date"/>
      </result>
    </xsl:template>

</xsl:stylesheet>"""

from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor
import time
src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')

proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
params = {u'date': unicode(time.asctime())}
result = proc.run(src_isrc, topLevelParams=params)
print result

6.6 Using xml-stylesheet processing instructions

4Suite honors the Associating Stylesheets with XML Documents W3C Recommendation and RFC 3023: XML Media Types. Instead of (or in addition to) using the processor's explicit APIs to establish the stylesheet to be used for the transformation, the source document may contain an xml-stylesheet processing instruction (PI) that refers to a stylesheet via a URI reference.

The xml-stylesheet PI must meet the following criteria:

  • It must appear in the document prolog.

  • It must contain a "type" pseudo-attribute having one of the following values:

    • application/xslt+xml

    • application/xslt

    • text/xml

    • application/xml

  • It must contain an "href" pseudo-attribute that is a URI reference for the stylesheet. It will be resolved relative to the base URI of the source document that contains the xml-stylesheet PI.

This example shows a PI being used to refer to the identity stylesheet mentioned earlier

<?xml-stylesheet type="application/xslt" href="identity.xsl"?>

If you need to add to the supported media types, e.g., to add the nonstandard "text/xsl", then follow the example given in this mailing list message.

If the PI contains "alternate" and "media" pseudo-attributes, the package will do its best to handle them. See this message for details and examples.

6.7 Alternative output destinations

Normally, the processor buffers all output, then returns it as a byte string. If you want to write directly to some other stream (any Python file-like object that has a write method), you can supply the stream as the optional outputStream argument to the Processor's run method. When you supply your own output stream, the run method will return None. Here is an example that writes directly to stdout:

SRC = """<?xml version="1.0"?><dummy/>"""

STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="yes" encoding="us-ascii"/>

  <xsl:template match="/">
    <result>hello world</result>
  </xsl:template>

</xsl:stylesheet>"""

import sys
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor

src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')

proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
result = proc.run(src_isrc, outputStream=sys.stdout)

Example 1 — Transform output sent to standard out

You also have the option of other kinds of output. Just set the writer argument of the processor's run method to an instance of an XSLT output writer, which is a handler of SAX-like events coming from the processor as it generates the result tree. 4Suite provides several writer classes for alternative output:

  • If you want the XSLT output as SAX events, use an instance of Ft.Xml.Xslt.SaxWriter.SaxWriter. Give its constructor a saxHandler keyword argument that is your own PyXML SAX2 event handler.

  • If you want the XSLT output as a Domlette document, use an instance of Ft.Xml.Xslt.RtfWriter.RtfWriter. Give its constructor a second argument: the base URI of the document to create. Obtain the document by calling the writer's getResult method after XSLT processing is finished.

  • If you want the XSLT output as any other kind of Python DOM document, use an instance of Ft.Xml.Xslt.DomWriter.DomWriter. Give its constructor an implementation keyword argument that is your desired DOM implementation. Also try to set the ownerDoc to an existing Document node (from the same implementation) from which a base URI for the new document can be obtained.

  • If you want the XSLT output in a regular file, open a file for writing then pass this file object to the proc.run as the outputStream parameter value, in the same way as the example above which used the sys.stdout file object. An example is shown below.

  • If you want to make a custom output writer, just make your class extend Ft.Xml.Xslt.NullWriter.NullWriter. If it needs access to the XSLT output parameters, then the constructor should take an instance of Ft.Xml.Xslt.OutputParameters.OutputParameters, which will have the data attributes method, version, encoding, omitXmlDeclaration, standalone, doctypeSystem, doctypePublic, mediaType, cdataSectionElements, and indent, which your writer can act upon, if appropriate. See the NullWriter API documentation for further info.

Here is an example of writing to a regular Domlette document:

SRC = """<?xml version="1.0"?><dummy/>"""

STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="yes" encoding="us-ascii"/>

  <xsl:template match="/">
    <result>hello world</result>
  </xsl:template>

</xsl:stylesheet>"""

import sys
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor
from Ft.Xml.Xslt.DomWriter import DomWriter
from Ft.Xml.Domlette import PrettyPrint

src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')

from Ft.Xml.Domlette import implementation as impl
domlette_writer = DomWriter(implementation=impl)

proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)
proc.run(src_isrc, writer=domlette_writer)

result_doc = domlette_writer.getResult()
PrettyPrint(result_doc)

This example writes the transform output to a file. This is a variant of the earlier one. Output is written to tmp.xml.

SRC = """<?xml version="1.0"?><dummy/>"""

STY = """<?xml version="1.0"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:output method="xml" indent="yes" encoding="us-ascii"/>

  <xsl:template match="/">
    <result>hello world</result>
  </xsl:template>

</xsl:stylesheet>"""

import sys
from Ft.Xml import InputSource
from Ft.Xml.Xslt import Processor

src_isrc = InputSource.DefaultFactory.fromString(SRC, 'http://foo/dummy.xml')
sty_isrc = InputSource.DefaultFactory.fromString(STY, 'http://foo/dummy.xsl')

proc = Processor.Processor()
proc.appendStylesheet(sty_isrc)

f = open('tmp.xml', mode='w')
result = proc.run(src_isrc, outputStream=f)
f.close()

There are many more options available for customizing XSLT development; see the Processor module documentation for details:

>>> from Ft.Xml.Xslt import Processor
>>> help(Processor)

6.8 Transform chaining

4Suite provides some hooks for scenarios where the output from one transform becomes the source document for another. This is called transform chaining. The user still has to write the sequence of transform invocations in the Python API (the 4xslt command can perform chaining for the user). This section shows how.

In the following example the next transform in the chain is set from within XSLT.

# The first transform: just reproduces all para elements within a wrapper
TRANSFORM = """
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:f="http://xmlns.4suite.org/ext"
  extension-element-prefixes="f"
>

<!-- Top level param so that user can pass in the next transform in the
     chain.  By default, use the identity transform -->
<xsl:param name="next-xslt"/>

<!-- grab just the first paras for the output -->
<xsl:template match="/">
  <parawrapper>
    <xsl:apply-templates select="//para"/>
  </parawrapper>
  <!-- Set the next transform in the chain.  You can also set to a
       hard-coded string -->
  <!-- notice that this is within a template, for instantiation -->
  <f:chain-to href="{$next-xslt}"/>
</xsl:template>

<xsl:template match="para">
  <xsl:copy-of select="."/>
</xsl:template>

</xsl:stylesheet>"""

DOC = """<doc>a<para>1</para>b<para>2</para>c</doc>"""

from Ft.Xml.Xslt import Processor
from Ft.Xml import InputSource

transform = InputSource.DefaultFactory.fromString(TRANSFORM, "urn:x-bogus:main.xslt")

IDT = u'http://cvs.4suite.org/viewcvs/*checkout*/4Suite/Ft/Data/identity.xslt'

processor = Processor.Processor()
processor.appendStylesheet(transform)
source = InputSource.DefaultFactory.fromString(DOC, "urn:x-bogus:doc.xml")
result = processor.run(source, topLevelParams={(None, 'next-xslt'): IDT})
print result

# processor.chainTo is the fully-resolved absolute URI of the next transform,
# or None if there was no f:chain-to element instantiated in the transform that
# the processor last processed.
next = processor.chainTo

processor = Processor.Processor()
processor.appendStylesheet(InputSource.DefaultFactory.fromUri(next))
source = InputSource.DefaultFactory.fromString(DOC, "urn:x-bogus:doc.xml")
result = processor.run(source)
print result

next = processor.chainTo                      # Should now be None
print "chainTo:", processor.chainTo

Note: There is not yet an API for automating the transform chain loop above. Ideas were discussed and an experiment was conducted here. If you have ideas for a good API, please submit them to the mailing list.

6.9 XSLT patterns

XSLT defines a pattern language based on XPath which is used to declare rules for matching patterns in the XML source against which to fire XSLT templates. The pattern implementation that 4Suite's XSLT library uses is also exposed as a library of its own. XSLT patterns are useful when your task is not so much to compute arbitrary information from a given node but, rather, to choose quickly from a collection of nodes the ones that meet some basic rules. This might seem a subtle difference. The following example might help illustrate it.

  • XPath task: extract the class attribute from all the child elements of the context node

  • XSLT pattern task: given a list of nodes, sort them into piles of those that have a class attribute and those that have a title child

The main API for pattern processing in 4Suite is Ft.Xml.Xslt.PatternList. The following is a code snippet that takes a node and returns a list of patterns it matches.

from Ft.Xml.Xslt import PatternList
from Ft.Xml.Domlette import NonvalidatingReader

# first pattern matches nodes with an href attribute
# the second matches elements with a title child
PATTERNS = ["*[@class]", "*[title]"]

# Second parameter is a dictionary of prefix to namespace mappings
plist = PatternList(PATTERNS, {})

DOC = """
<spam>
  <e1 class="1"/>
  <e2><title>A</title></e2>
  <e3 class="2"><title>B</title></e3>
</spam>"""

doc = NonvalidatingReader.parseString(DOC, "file:foo.xml")
for node in doc.documentElement.childNodes:
    # Don't forget that the white space text nodes before and after
    # e1, e2 and e3 elements are also child nodes of the spam element
    if node.nodeName[0] == "e":
        print plist.lookup(node)

The PatternList initializer takes my list of strings, which it conveniently converts to a list of compiled pattern objects. Such objects have a match method that returns a boolean value, but I don't use these methods directly in this example. The PatternList initializer also takes a dictionary that makes up the namespace mapping. In this example, we use no namespaces, so the dictionary is empty. The lookup method is applied to a selection of the children of the spam element (all the nodes whose name starts with "e", which happens to be all the element nodes). The output of listing 4 follows:

[*[attribute::class]]
[*[child::title]]
[*[attribute::class], *[child::title]]

The output is a list of the representations of the pattern objects that matched each node. Notice how the axis abbreviations have been expanded in the pattern object representation.

7 XPath and XSLT extensions

Sometimes the built-in facilities of XPath and XSLT aren't quite enough to meet your processing needs. Luckily it's easy to extend the function of these libraries using user extension functions and elements, which are written in Python.

7.1 Extension functions (XPath and XSLT)

To define your own extension functions for XPath and XSLT, you write corresponding Python function in a module, and provide a mapping from the desired XPath function names to Python function objects (or any callables). Start with a simple example. The following is a complete module which defines a single XPath function, unichr(s) a simple example that takes a string and returns the Unicode code point number for the first character in that string.

#ord.py
from Ft.Xml.XPath import Conversions

def Ord(context, s):
    '''
    Available in XPath as ord() as defined by ExtFunctions mapping below
    Takes an object, which is coerced to string
    Returns the Unicode code point number for the first character in that string    Or returns -1 if it's an empty string
    '''
    s = Conversions.StringValue(s)  #Coerce the passed object to string
    if s:
        return ord(s[0])
    else:
        return -1

ExtFunctions = {
    (u'urn:x-4suite:x', u'ord'): Ord,
}

As this simple example illustrates, The basic way to map XPath function names to Python function objects is in dictionary named "ExtFunctions", global to the module in which the extension function is defined. The XPath/XSLT extension names are expressed as a Python tuple of two Unicode objects. If you're familiar with XPath, this is just a Python representation of an expanded name. The first item in the expanded name tuple is the namespace URI for the element, and the second is the local name. The namespace URI cannot be an empty string.

You have to actually tell the processor to load your extension modules. There are several ways to do so.

  1. From Python code you can register them in a context object used for XPath processing by using the optional extModuleList to pass in a list of module objects.

  2. You can also register particular functions rather than a complete module in a XPath context object using the optional extFunctionMap argument. It takes a mapping dictionary similar to the ExtFunctions dictionary shown in the above sample module.

  3. If you are using the XSLT processor you can register extension functions on a processor object using the registerExtensionModules() method.

  4. When using the XSLT processor you can also register individual extension functions on a processor object using registerExtentionFunction() method. It takes the namespace and localName for the extension function and the callable object that implements it).

  5. In some cases the user can list extension modules using the environment variable "EXTMODULES". "EXTMODULES" is a colon-separated list of Python modules names. This works for the 4xslt command line and for Ft.Xml.XPath.Evaluate. For other APIs, use one of the other two methods, which can easily be extended to read the "EXTMODULES" variable. In general the other methods for registering extensions are preferable.

Note that extension modules will automatically be searched for XSLT extension elements as well as functions.

The following is a longer example, a module that implements two functions are. One returns the current time and the other creates a hash of the context node name:

# demo.py
import time, urlparse
from Ft.Xml.XPath import Conversions

def GetCurrentTime(context):
    '''available in XPath as get-current-time()'''
    return time.asctime(time.localtime())

def HashContextName(context, maxkey):
    '''
    available in XPath as hash-context-name(maxkey),
    where maxkey is an object converted to number
    '''
    # It is a good idea to use the appropriate core function to coerce
    # arguments to the expected type
    maxkey = Conversions.NumberValue(maxkey)
    key = reduce(lambda a, b: a + b, context.node.nodeName)
    return key % maxkey

ExtFunctions = {
    ('urn:x-4suite:x', 'get-current-time'): GetCurrentTime,
    ('urn:x-4suite:x', 'hash-context-name'): HashContextName
}

You can use this in plain XPath as follows:

from Ft.Xml.XPath.Context import Context
from Ft.Xml.XPath import Compile, Evaluate
from Ft.Xml.Domlette import NonvalidatingReader

DOC = "<spam xmlns='http://spam.com'>eggs</spam>"

ctx = Context(None, extFunctionMap=demo.ExtFunctions,
              processorNss={"a": "http://spam.com"})
expr = Compile("get-current-time()")

doc = NonvalidatingReader.parseString(DOC, "http://spam.com/base")
print Evaluate(expr, doc, ctx)

Notice that you might choose to use None for the extension function namespaces. If so, you don't need to specify the processorNss context attribute, but you might want to watch out for clashes with other extenstion function names, including the built-in library. Again, if you plan to use an extension function from within XSLT, its namespace URI must not be None.

You can use this in XSLT just as easily:

# useextfunc.py

TRANSFORM = """<?xml version="1.0"?>
<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:s="urn:x-4suite:x"
  version="1.0">

  <xsl:template match="/">
    <xsl:value-of select="s:get-current-time()"/>
  </xsl:template>

</xsl:stylesheet>
"""

SOURCE = """<dummy/>"""

from Ft.Xml.Xslt import Processor
processor = Processor.Processor()

# Register the extension function using method (3)
processor.registerExtensionModules(['demo'])
from Ft.Xml import InputSource
transform = InputSource.DefaultFactory.fromString(TRANSFORM, "http://foo.com")
source = InputSource.DefaultFactory.fromString(SOURCE, "http://foo.com")
processor.appendStylesheet(transform)
result = processor.run(source)
print result

For good examples of modules with extension elements, see the source code for the modules Ft.Xml.XPath.BuiltInExtFunctions, Ft.Xml.Xslt.BuiltInExtFunctions and the modules in Ft.Xml.Xslt.Exslt. The latter are especially good examples given their diversity and detailed specifications at exslt.org.

7.2 Extension elements (XSLT)

To define your own extension elements, define a class derived from Ft.Xml.Xslt.XsltElement. The module in which it is defined should have a global dictionary named "ExtElements" mapping element expanded names to element class objects.

Finally, modules containing any extension elements used must be indicated as such to the processor in one of several ways.

  1. You can register all extension functions and elements in a module by using a processor object's registerExtensionModules() method.

  2. You can also register individual extension elements on a processor object using registerExtensionElement() method. It takes the namespace and localName for the extension function and the callable object that implements it).

  3. In some cases the user can list extension modules using the environment variable "EXTMODULES". "EXTMODULES" is a colon-separated list of Python modules names. This works for the 4xslt command line and for Ft.Xml.XPath.Evaluate. For other APIs, use one of the other two methods, which can easily be extended to read the "EXTMODULES" variable. In general the other methods for registering extensions are preferable.

Note that extension modules will automatically be searched for XPath extension functions as well as Extension elements.

7.3 Extension element API

There are several aspects of the extension element API worth discussing in more detail.

The class-level "content" variable specifies a content model to be enforced by the XSLT processor. If the element is used in a way that doesn't meet the specified content model, the user will get an error message. The content model is a structure that uses certain special classes, including:

  • ContentInfo.Empty - matches no content at all (empty element)

  • ContentInfo.Text - matches plain text content

  • ContentInfo.Seq - matches the given sequence of sub-patterns

  • ContentInfo.Alt - matches one of the given choice of sub-patterns

  • ContentInfo.Rep - matches 0 or more repeated instances of the given sub-pattern

  • ContentInfo.Rep1 - matches 0 or more repeated instances of the given sub-pattern

  • ContentInfo.Opt - matches zero or one of the given sub-pattern

  • ContentInfo.ResultElements - matches elements not in the XSL namespace

  • ContentInfo.Instructions - matches any sequence of XSLT elements categorized as instructions in the spec

  • ContentInfo.Template - matches an XSLT template body according to the spec

  • ContentInfo.TopLevelElements - matches any sequence of XSLT elements categorized as top level in the spec

  • ContentInfo.QName - matches a particular element by giving its namespace and node name (the prefix in the node name is only used for documentation and error messages)

So, for instance, the xsl:choose element would be described as

content = ContentInfo.Seq(
    ContentInfo.Rep1(ContentInfo.QName(XSL_NAMESPACE, 'xsl:when')),
    ContentInfo.Opt(ContentInfo.QName(XSL_NAMESPACE, 'xsl:otherwise')),
    )

The class-level "legalAttrs" variable specifies the attributes allowed or required on the element. It is a Python dictionary mapping attribute name to its specification. The specification is a class according o the type of attribute.

The following are the supported attribute classes. The parameters specified are for the initializer. Note that most general patterns have a plain variant and an attribute value template (AVT) variant:

  • AttributeInfo.String - any XPath string

  • AttributeInfo.StringAvt - an AVT yielding any string

  • AttributeInfo.Char - any XPath string of length 1

  • AttributeInfo.CharAvt - AVT version of Char

  • AttributeInfo.Choice - a string which must be one of a number of given values. The values are given by a list of strings with is the first parameter

  • AttributeInfo.ChoiceAvt - AVT version of Choice

  • AttributeInfo.YesNo - Abbreviation for AttributeInfo.Choice ( See Oasis web site)

  • AttributeInfo.YesNoAvt - AVT version of YesNo

  • AttributeInfo.Number - any XPath number

  • AttributeInfo.NumberAvt - AVT version of Number

  • AttributeInfo.UriReference - XPath string that is syntactically a URI reference

  • AttributeInfo.UriReferenceAvt - AVT version of UriReference

  • AttributeInfo.Id - XPath string that is syntactically an XML ID

  • AttributeInfo.IdAvt - AVT version of Id

  • AttributeInfo.QName - XPath string that is syntactically an XML namespaces qualified name

  • AttributeInfo.QNameAvt - AVT version of QName

  • AttributeInfo.NCName - XPath string that is syntactically an XML namespaces "no colon" name

  • AttributeInfo.NCNameAvt - AVT version of NCName

  • AttributeInfo.Prefix - Same as NCName

  • AttributeInfo.PrefixAvt - Same as NCNameAvt

  • AttributeInfo.NMToken - XPath string that is syntactically an XML Name token

  • AttributeInfo.NMTokenAvt - AVT version of NMToken

  • AttributeInfo.QNameButNotNCName - A QName that contains a colon

  • AttributeInfo.QNameButNotNCNameAvt - AVT version of QNameButNotNCName

  • AttributeInfo.Token - XPath string that is syntactically an XPath name test (i.e. "foo", "ns:foo", ns:" or "")

  • AttributeInfo.TokenAvt - AVT version of Token

  • AttributeInfo.Expression - XPath string that is syntactically an XPath expression

  • AttributeInfo.ExpressionAvt - AVT version of Expression

  • AttributeInfo.StringExpression - XPath string that is syntactically an XPath expression, which would be expected to return a string value

  • AttributeInfo.StringExpressionAvt - AVT version of StringExpression

  • AttributeInfo.NodeSetExpression - XPath string that is syntactically an XPath expression, which would be expected to return a node set value

  • AttributeInfo.NodeSetExpressionAvt - AVT version of NodeSetExpression

  • AttributeInfo.NumberExpression - XPath string that is syntactically an XPath expression, which would be expected to return a number value

  • AttributeInfo.NumberExpressionAvt - AVT version of NumberExpression

  • AttributeInfo.BooleanExpression - XPath string that is syntactically an XPath expression, which would be expected to return a boolean value

  • AttributeInfo.BooleanExpressionAvt - AVT version of BooleanExpression

  • AttributeInfo.Pattern - XPath string that is syntactically an XSLY pattern

  • AttributeInfo.PatternAvt - AVT version of Pattern

  • AttributeInfo.Tokens - XPath string that is syntactically a space-delimited series of tokens

  • AttributeInfo.TokensAvt - AVT version of Tokens

  • AttributeInfo.QNames - XPath string that is syntactically a space-delimited series of QNames

  • AttributeInfo.QNamesAvt - AVT version of QNames

  • AttributeInfo.Prefixes - XPath string that is syntactically a space-delimited series of NCNames

  • AttributeInfo.PrefixesAvt - AVT version of Prefixes

All of these classes take the following optional keyword parameters:

  • description - for documentation

  • default - the default value of the attribute to be used if omitted

Some examples from the XSLT spec:

xsl:output

content = ContentInfo.Empty
legalAttrs = {
    'method' : AttributeInfo.QName(),
    'version' : AttributeInfo.NMToken(),
    'encoding' : AttributeInfo.String(),
    'omit-xml-declaration' : AttributeInfo.YesNo(),
    'standalone' : AttributeInfo.YesNo(),
    'doctype-public' : AttributeInfo.String(),
    'doctype-system' : AttributeInfo.String(),
    'cdata-section-elements' : AttributeInfo.QNames(),
    'indent' : AttributeInfo.YesNo(),
    'media-type' : AttributeInfo.String(),
    }

xsl:sort

content = ContentInfo.Empty
legalAttrs = {
    'select' : AttributeInfo.StringExpression(default='.'),
    'lang' : AttributeInfo.NMTokenAvt(),
    # We don't support any additional data-types, hence no
    # AttributeInfo.QNameButNotNCName()
    'data-type' : AttributeInfo.ChoiceAvt(['text', 'number'],
                                          default='text'),
    'order' : AttributeInfo.ChoiceAvt(['ascending', 'descending'],
                                      default='ascending'),
    'case-order' : AttributeInfo.ChoiceAvt(['upper-first', 'lower-first']),
    }

xsl:number

content = ContentInfo.Empty
legalAttrs = {
    'level' : AttributeInfo.Choice(['single', 'multiple', 'any'],
                                   default='single'),
    'count' : AttributeInfo.Pattern(),
    'from' : AttributeInfo.Pattern(),
    'value' : AttributeInfo.Expression(),
    'format' : AttributeInfo.StringAvt(default='1'),
    'lang' : AttributeInfo.NMToken(),
    'letter-value' : AttributeInfo.ChoiceAvt(['alphabetic', 'traditional']),
    'grouping-separator' : AttributeInfo.CharAvt(),
    'grouping-size' : AttributeInfo.NumberAvt(default=0),
    }

Of course, it's always a good idea to use descriptions, which the above do not.

For good examples of modules with extension elements, see the source code for the modules Ft.Xml.Xslt.BuiltInExtElements and Ft.Xml.Xslt.Exslt.Common . The various modules in Ft.Xml.Xslt.Exslt have a strong diversity and make good examples, especially given their detailed specifications at exslt.org

7.3.1 Controlling output from XSLT extensions

The most common special need for XSLT extensions is to generate XSLT output. For extension elements this is easy enough to do using the API on the procssor instance that is passed to the instantiate() method of extension element classes. For example

class SpamElement(XsltElement):
    legalAttrs = {}
    def instantiate(self, context, processor):
        processor.output().startElement('title')
        processor.output().text('Life of Brian'))
        processor.output().endElement('title')
        return (context,)

Extension functions are not passed a processor instance directly, but context objects hold a reference to the processor in effect, so the following example works:

def Spam(context):
    context.processor.output().startElement('title')
    context.processor.output().text('Life of Brian'))
    context.processor.output().endElement('title')
    return

However, it is probably better design to reserve such side effects as output for extension elements rather than functions.

In the above examples the elements and text out out just use the current output parameters. In order to change output parameters or change the output stream, you can stack a new output handler:

stream = cStringIO.StringIO()

# Clone the current outputparameters
op = processor.writers[-1]._outputParams.clone()

# Force XML output method with XML declaration
# Output method is a qualified name, so must flag null ns
# to use standard xml method
op.method = (EMPTY_NAMESPACE, 'xml')
op.omitXmlDeclaration = "yes"

# Push the new handler to the top of the writer stack
processor.addHandler(op, stream)
processor.output().startElement('title')
processor.output().text('Life of Brian'))
processor.output().endElement('title')

# Pop back to the previous handler stream.getvalue()
# now contains the new  output
processor.removeHandler()

7.3.2 Creating result tree fragments

Another common need is to treat the body of an extension element as a template so that something can be done with the RTF that results from it. The following example demonstrates this:

try:
    # Set the output to an RTF riter, which wll create an RTF for us
    processor.pushResultTree(self.baseUri)

    # The template is manifested as children of the extension element
    # node.  Instantiate each in turn
    for child in self.children:
        child.instantiate(context, processor)
# You want to be sure you re-balance the stack even in case of error
finally:
    # Retrieve the resulting RTF
    result_rtf = processor.popResult()

7.3.3 Comunicating with the external code that invokes XSLT

You can set and communicate state information with external code by using the processor.extensionParams attribute. For example, the following sents a time stamp of precisely when the extension was instantiated, which can later be retrieved from the processor after the XSLT process, or even by later extensions. In a similar way, state can be set up by calling functions and retrieved by extensions.

# Extension parameters have fully qualified names, so you must come up
# with a namespace to set them
processor.extensionParams[(SPAM_NAMESPACE, 'tstamp')] = time.time()

8 Streaming XML output

MarkupWriter is a streaming API for generating XML. The Ft.Xml.MarkupWriter class is specialized for creating XML documents from scratch. Documents written with MarkupWriter are written to the output (standard output or another file-like object) as you build them, so if you need to process the document in memory, you may need another tool such as a DOM-like tool (e.g. Domlette, Amara, etc).

4Suite partitions XML serializers into two categories: writers and printers.

  • A writer is a module that exposes a broad public API for building output incrementally.

  • A printer is a module that simply takes a DOM and creates output from it as a whole, within one API invocation.

MarkupWriter is the primary example of this writer category of XML serializers.

The following example uses this class for generating a simple XML Software Autoupdate (XSA) file. XSA is a XML data format for listing and describing software packages.

from Ft.Xml import MarkupWriter

# Set the output doc type details (required by XSA)
SYSID = u"http://www.garshol.priv.no/download/xsa/xsa.dtd"
PUBID = u"-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML"
writer = MarkupWriter(indent=u"yes", doctypeSystem=SYSID,
                      doctypePublic=PUBID)
writer.startDocument()
writer.startElement(u'xsa')
writer.startElement(u'vendor')

# Element with simple text (#PCDATA) content
writer.simpleElement(u'name', content=u'Centigrade systems')
writer.simpleElement(u'email', content=u"info@centigrade.bogus")
writer.endElement(u'vendor')

# Element with an attribute
writer.startElement(u'product', attributes={u'id': u"100\u00B0"})
writer.simpleElement(u'name', content=u"100\u00B0 Server")
writer.simpleElement(u'version', content=u"1.0")
writer.simpleElement(u'last-release')
writer.text(u"20030401")

# Empty element
writer.simpleElement(u'changes')
writer.endElement(u'product')
writer.endElement(u'xsa')
writer.endDocument()

This is the output we get from the code above:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE xsa PUBLIC "-//LM Garshol//DTD XML Software Autoupdate 1.0//EN//XML" "http://www.garshol.priv.no/download/xsa/xsa.dtd">
<xsa>
  <vendor>
    <name>Centigrade systems</name>
    <email>info@centigrade.bogus</email>
  </vendor>
  <product id="100°">
    <name>100° Server</name>
    <version>1.0</version>
    <last-release>20030401</last-release>
    <changes/>
  </product>
</xsa>

The above example illustrates some of the basics of using the MarkupWriter class. The following sections describe both the essential and the advanced features of this class. In many cases, there often exists more than one way to output a given document section.

8.1 Starting with MarkupWriter

After importing the MarkupWriter class, you have to create a MarkupWriter object instance and then start the new Document. (See below for output options of MarkupWriter.) Remember that you are working with a streaming API. You must decide what features you want your output to have before you start to write that output.

>>> from Ft.Xml import MarkupWriter
>>> writer = MarkupWriter()
>>> writer.startDocument() 

You are now ready to add data to the new document.

Important

Make sure that all of your data (element names, attributes, content, etc) are Python unicode objects.

8.2 How to insert elements

There are two ways to add new elements as children of other document or element nodes.

  1. When you want to add a new element that will itself have child elements, you can use the startElement/endElement method combination to signal the beginning and the ending of an element, respectively.

    writer.startElement(u'xsa')
    # other document content can be output here
    writer.endElement(u'xsa')

  2. Alternatively, you can use the simpleElement method, which is a shortcut for the startElement/endElement combination and produces an element with no content or with text content (if you specify the content parameter).

    writer.simpleElement(u'xsa')

8.3 How to insert attributes

There are two ways to add attributes to elements:

  1. First, you can use the attributes parameter of the startElement method. This parameter is a dictionary which maps each attribute name to the value of that attribute. If an attribute's name is in a namespace, then you must specify the name as a Python tuple, with the attribute's QName as the first member of the tuple, and the namespace URI as the second member of the tuple. For an example of this advanced syntax, see “Writing XHTML with MarkupWriter.

    writer.startElement(u'product', attributes={u'id': u"100\u00B0"}

  2. Alternatively, you can use a distinct attribute method with two parameters: the attribute's name and the attribute's value. As with the dictionary approach above, if the attribute's name is in a namespace, then the whole name should be a Python tuple.

    writer.startElement(u'product')
    writer.attribute(u'id', u"100\u00B0")

8.4 How to insert text nodes

Similarly, there are two ways to add text nodes to elements.

  1. First, the simpleElement method takes a content parameter, which can be used to create a single text node child of the node with the specified name.

    writer.simpleElement(u'name', content=u'Centigrade systems')

  2. Alternatively, instances of the MarkupWriter class, such as writer, have a text method that inserts a single text node as the next child of the element which was last started with the startElement method and which has not yet been closed with the endElement method.

    writer.startElement(u'product')
    writer.text(u'Centigrade systems')
    writer.endElement(u'product')

8.5 How to insert a complete chunk

MarkupWriter also allows you to insert well-formed XML entities as complete chunks in the output. This is a very convenient way to emit boilerplate XML without breaking it down into all the separate element/attribute/content bits. As such the lines:

writer.simpleElement(u'name', content=u"100\u00B0 Server")
writer.simpleElement(u'version', content=u"1.0")
writer.simpleElement(u'last-release', content=u"20030401")

Could instead be written:

writer.xmlFragment("""
<name>100° Server</name>
<version>1.0</version>
<last-release>20030401</last-release>""")
Important

The parameter of xmlFragment is a string, not a unicode object.

8.6 How to insert processing instructions and comments

The API provides the comment and processingInstruction methods for inserting processing instructions and comments. The comment method takes a unicode string, which is the intended value of the comment. The processingInstruction method takes two unicode strings. The first is the name of the processing instruction, and the second is the value of the processing instruction. For example, the following code:

writer.comment(u"This is a processing instruction")
writer.processingInstruction(u'xml-stylesheet', u'type="text/xsl" href="akara.xsl"')
produces the following output:
<!--This is a processing instruction-->
<?xml-stylesheet type="text/xsl" href="akara.xsl"?>

8.7 Using namespaces

When you create a new element or an attribute, you can use namespaces. See the next program:

from Ft.Xml import MarkupWriter

writer = MarkupWriter(indent=u'yes')
writer.startDocument()

RDFNS = u"http://www.w3.org/1999/02/22-rdf-syntax-ns#"

writer.startElement(u"rdf:RDF", RDFNS)
writer.startElement(u"rdf:Description", RDFNS,
    attributes={(u'rdf:about', RDFNS): u'http://media.example.com/audio/guide.ra'})
writer.endElement(u'rdf:Description', RDFNS)
writer.endElement(u'rdf:RDF', RDFNS)

And this is the output:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
    <rdf:Description rdf:about="http://media.example.com/audio/guide.ra"/>
</rdf:RDF>

8.8 Setting up the output

In the above example, you can see how parameters that control the output are passed into the MarkupWriter initializer, including document type info and whether to indent (pretty print).

You can pass any of the usual controls for XSLT output into the initializer this way.

stream

By default MarkupWriter sends its output to sys.stdout, but you can substitute any file-like object by passing in an initializer parameter. This stream parameter should be the first argument to the MarkupWriter constructor. For example:

output_file = file('output.xml', 'w')
writer = MarkupWriter(output_file, indent=u"yes")

indent

The indent named parameter controls whether or not the output will have whitespace inserted to indent tags in the output. The default is "no".

doctypeSystem, doctypePublic

These two named parameters control the system and public identifiers that will be included in the output.

omitXmlDeclaration=u"yes"

This named parameter can be used to suppress output of the XML declaration. The default is "no".

encoding

This named parameter controls the character encoding to use. (The default is UTF-8.) The writer will automatically use character entities where necessary.

standalone

Set this named parameter to "yes" to set standalone in the XML declaration.

mediaType

This parameter sets the media type of the output. You will probably never need this.

cdataSectionElements

This named parameter is a list of element names whose output will be wrapped in a CDATA section. This can provide for friendlier output in some cases.

The XSLT spec also defines a method parameter to choose between XML, HTML or plain text output rules, but for MarkupWriter at the moment you should stick to XML. The result of changing the method is undefined. We'll probably relax this restriction in later releases.

8.9 More examples

8.9.1 Writing XHTML with MarkupWriter

Uche Ogbuji provides this example, which writes a simple XHTML file, in his blog:

from Ft.Xml.MarkupWriter import MarkupWriter
from xml.dom import XHTML_NAMESPACE, XML_NAMESPACE

XHTML_NS = unicode(XHTML_NAMESPACE)
XML_NS = unicode(XML_NAMESPACE)
XHTML11_SYSID = u"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
XHTML11_PUBID = u"-//W3C//DTD XHTML 1.1//EN"

writer = MarkupWriter(indent=u"yes", doctypeSystem=XHTML11_SYSID,
                      doctypePublic=XHTML11_PUBID)
writer.startDocument()
writer.startElement(u'html', XHTML_NS, attributes={(u'xml:lang', XML_NS): u'en'})
writer.startElement(u'head', XHTML_NS)
writer.simpleElement(u'title', XHTML_NS, content=u'Virtual Library')
writer.endElement(u'head', XHTML_NS)
writer.startElement(u'body', XHTML_NS)
writer.startElement(u'p', XHTML_NS)
writer.text(u'Moved to ')
writer.simpleElement(u'a', XHTML_NS,
                     attributes={u'href': u'http://vlib.org/'},
                     content=u'vlib.org')
writer.text(u'.')
writer.endElement(u'p', XHTML_NS)
writer.endElement(u'body', XHTML_NS)
writer.endElement(u'html', XHTML_NS)
writer.endDocument()

This example results in the following XHTML document:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
  <head>
    <title>Virtual Library</title>
  </head>
  <body>
    <p>Moved to <a href="http://vlib.org/">vlib.org</a>.</p>
  </body>
</html>  

8.9.2 Writing information of directory listing as a XML document

This recursive example builds an XML document with the information of a directory listing. The example has two functions. The first initializes the writer. The second walks through the filesystem and outputs information about the filesystem as XML. The complete dirlist.py program can be found on Uche Ogbuji's blog.

def genXML(dir,out):
    print "Processing %s" % dir
    writer = MarkupWriter(out, indent=u"yes")
    writer.startDocument()
    recurse_dir(dir,writer)

def recurse_dir(path,writer,d):
    d=d+1
    for cdir, subdirs, files in os.walk(path):
        writer.startElement(u'directory', attributes={u'name': unicode(cdir)})
        for f in files:
            writer.simpleElement(u'file', attributes={u'name': unicode(f)})
        for subdir in subdirs: recurse_dir(os.path.join(cdir, subdir), writer,d)
        writer.endElement(u'directory')
        break

8.9.3 Building a bot

As a more complex example, the Emeka IRC bot uses MarkupWriter to build an RDF document. It writes namespaces. See this chunk of the code:

DCE_NS = u'http://purl.org/dc/elements/1.1/'
for nada,category in item['categories']:
    if len(category.split(' ')) > 0:
        for category in category.split(' '):
            writer.startElement(u"dc:subject", DCE_NS)
            writer.text(category)
            writer.endElement(u"dc:subject")
    else:
        writer.startElement(u"dc:subject", DCE_NS)
        writer.text(category)
        writer.endElement(u"dc:subject", DCE_NS)

9 Validation using RELAX NG

4Suite has RELAX NG support based on a bundling of Eric van der Vlist's XVIF implementation.

First of all, you can use the 4xml command line for RELAX NG validation with the --rng flag. For instance, take the following RELAX NG schema (rng-tut3.rng):

<element name="addressBook" xmlns="[http://relaxng.org/ns/structure/1.0][13]">
  <zeroOrMore>
    <element name="card">
      <element name="name">
        <text/>
      </element>
      <element name="email">
        <text/>
      </element>
    </element>
  </zeroOrMore>
</element>

The following document (rng-tut1.xml) is valid against the schema:

<addressBook>
  <card>
    <name>John Smith</name>
    <email>js@example.com</email>
  </card>
  <card>
    <name>Fred Bloggs</name>
    <email>fb@example.net</email>
  </card>
</addressBook>

As you can check as follows:

$ 4xml --rng=rng-tut3.rng rng-tut1.xml
<?xml version="1.0" encoding="utf-8"?>
<addressBook>
  <card>
    <name>John Smith</name>
    <email>js@example.com</email>
  </card>
  <card>
    <name>Fred Bloggs</name>
    <email>fb@example.net</email>
  </card>
</addressBook>

Since it passes the schema, 4xml continues normal operation, re-serializing the XML back to stdout.

The following document (rng-tut7.xml) is not valid against the schema:

<addressBook>i
  <card>
    <givenName>John</givenName>
    <familyName>Smith</familyName>
  <email>js@example.com</email>
  </card>
  <card>
    <name>Fred Bloggs</name>
    <email>fb@example.net</email>
  </card>
</addressBook>

Which you can check as follows:

$ 4xml --rng=rng-tut7.rng rng-tut1.xml 
Traceback (most recent call last):
  File "/home/uogbuji/lib/python2.2/site-packages/Ft/Share/Bin/4xml", line 5, in ?
    XmlCommandLineApp().run()
  File "/home/uogbuji/lib/python2.2/site-packages/Ft/Lib/CommandLine/CommandLineApp.py", line 90, in run
    cmd.run_command(self.authenticationFunction)
  File "/home/uogbuji/lib/python2.2/site-packages/Ft/Lib/CommandLine/Command.py", line 83, in run_command
    self.function(self.clOptions, self.clArguments)
  File "/home/uogbuji/lib/python2.2/site-packages/Ft/Xml/_4xml.py", line 89, in Run
    raise RngInvalid(result)
Ft.Xml.Xvif.RngInvalid: _Pattern Empty, no content expected, 
node <cElement at 0x838d7f4: name u'card', 0 attributes, 7 children>

The exception is for the invalid pattern.

You can also access validation through the Python API using the new Ft.Xml.Xvif.RelaxNgValidator class. For example:

from Ft.Xml.Xvif import RelaxNgValidator
from Ft.Xml import InputSource
from Ft.Lib import Uri
factory = InputSource.DefaultFactory
rng_uri = Uri.OsPathToUri("rng-tut3.rng", attemptAbsolute=1)
src_uri = Uri.OsPathToUri("rng-tut1.xml", attemptAbsolute=1)
rng_isrc = factory.fromUri(rng_uri)
src_isrc = factory.fromUri(src_uri)

validator = RelaxNgValidator(rng_isrc)
result = validator.isValid(src_isrc)
if result:
    print "Valid"
else:
    print "Invalid"

The isValid() method returns a 1 or 0 for validity. To get the actual structure returned by the validator, use the validate() method instead. This structure can easily be turned into an exception object. The following variation prints "Valid" if valid, and raises an exception if not:

from Ft.Xml.Xvif import RelaxNgValidator, RngInvalid
from Ft.Xml import InputSource
factory = InputSource.DefaultFactory
from Ft.Lib import Uri
factory = InputSource.DefaultFactory
rng_uri = Uri.OsPathToUri("rng-tut3.rng", attemptAbsolute=1)
src_uri = Uri.OsPathToUri("rng-tut1.xml", attemptAbsolute=1)
rng_isrc = factory.fromUri(rng_uri)
src_isrc = factory.fromUri(src_uri)

validator = RelaxNgValidator(rng_isrc)
result = validator.validate(src_isrc)
if result.nullable():
    print "Valid"
else:
    raise RngInvalid(result)

If you want to use the validation error message without raising an exception:

# Set-up as above
result = validator.validate(src_isrc)
if result.nullable():
    print "Valid"
else:
    print result.msg

Xvif does not report the location of validation errors, and stops after the first error. It does not support RELAX NG compact syntax (RNC) or nameClasses (name, anyName, nsName, and except elements in the schema). In addition, its support of XML Schema datatypes is incomplete, but has been extended by 4Suite to accommodate a number of types, including the following (asterisk indicates support is exclusive to 4Suite):

  • xs:string

  • xs:normalizedString

  • xs:token

  • xs:ID *

  • xs:IDREF *

  • xs:integer

  • xs:nonPositiveInteger

  • xs:nonNegativeInteger

  • xs:PositiveInteger

  • xs:negativeInteger

  • xs:unsignedLong

  • xs:unsignedInt

  • xs:long

  • xs:int

  • xs:short

  • xs:unsignedShort

  • xs:byte

  • xs:unsignedByte

  • xs:decimal

  • xs:date *

  • xs:boolean *

  • xs:time *

  • xs:dateTime *

  • xs:anyURI *

The numeric types all support the totalDigits, minInclusive, maxInclusive, minExclusive, and maxExclusive facets. xs:decimal also supports the fractionDigits facet.

The xs:string, xs:normalizedString, and xs:token types support the length facet. In 4Suite only, xs:string and xs:normalizedString support minLength, maxLength, and pattern facets.

10 XUpdate processing

XUpdate is a community specification for using an XML vocabulary to express modifications to XML documents. It is essentially an XPath-based XML transformation language, like XSLT. An XUpdate document is an XML document that specifies what changes should be made to another XML document. XUpdate is supported by many XML processing tools - especially in the open source category - and XUpdate is neither a W3C Recommendation nor an ISO or IETF standard. It is just a project of the XML:DB Initiative's XUpdate Working Group, and it never advanced beyond a Working Draft published in September, 2000. It is not very well specified, but it is very convenient and enables a basic level of functionality, so it has enjoyed popularity in a number of implementations.

4Suite's XUpdate implementation, 4XUpdate, consists of a Python API (via the Ft.Xml.XUpdate module) and a command-line script (4xupdate). The APIs involve taking a source document (the XML to be updated) and an XUpdate document (the changes to apply), and either producing a new document or updating the source document in-place. The command line tool can be used, for example, as a patching utility for XML. All of XUpdate (such as it's specified) is currently implemented.

The Python API can be invoked directly on Domlette objects or on InputSources. Here is an example of using the ApplyXUpdate convenience function, which takes InputSources:

from Ft.Xml.Domlette import PrettyPrint
from Ft.Xml.InputSource import DefaultFactory
try:
    from Ft.Xml.XUpdate import ApplyXUpdate
except ImportError:
    # the function name changed between 1.0a3 and 1.0b1
    from Ft.Xml.XUpdate import ApplyXupdate as ApplyXUpdate

SOURCE='''<?xml version = "1.0"?>
<ADDRBOOK xmlns="http://bogus/">
  <ENTRY ID="fr">
    <NAME>fred</NAME>
  </ENTRY>
</ADDRBOOK>'''

XU='''<?xml version="1.0"?>
<xu:modifications version="1.0" xmlns:xu="http://www.xmldb.org/xupdate"
  xmlns:myns="http://bogus/">
  <xu:append select="/myns:ADDRBOOK" child="last()">
    <ENTRY ID="vz">
      <NAME>Vasia Zhugenev</NAME>
    </ENTRY>
  </xu:append>
</xu:modifications>'''

src_isrc = DefaultFactory.fromString(SOURCE, "http://test1/")
xup_isrc = DefaultFactory.fromString(XU, "http://test2/")

result_dom = ApplyXUpdate(src_isrc, xup_isrc)
PrettyPrint(result_dom)

#expected:
#<?xml version="1.0" encoding="UTF-8"?>
#<ADDRBOOK xmlns="http://bogus/">
#  <ENTRY ID="fr">
#    <NAME>fred</NAME>
#  </ENTRY>
#<ENTRY ID="vz">
#    <NAME>Vasia Zhugenev</NAME>
#  </ENTRY>
#</ADDRBOOK>

If you have both the source document and XUpdate document as Domlette nodes already, you can use the XUpdate processor directly:

# add to the above script...
from Ft.Xml.Domlette import NonvalidatingReader
from Ft.Xml.XUpdate import Processor
src_isrc = DefaultFactory.fromString(SOURCE, "http://test1/")
xup_isrc = DefaultFactory.fromString(XU, "[http://test2/")
src_dom = NonvalidatingReader.parse(src_isrc)
xup_dom = NonvalidatingReader.parse(xup_isrc)
proc = Processor()
proc.execute(src_dom, xup_dom)

# src_dom has been modified in-place
PrettyPrint(src_dom)

Using the processor directly allows you to set XPath variables, if needed:

from Ft.Xml import EMPTY_NAMESPACE

# execute with $x='foo'
proc.execute(src_dom, xup_dom, {(EMPTY_NAMESPACE, u'x'): u'foo'})

The command-line script works on local files or even URIs, if resolvable, and normally sends the result XML to standard output, although it can also be made to write to a file. See "4xupdate -h" for usage instructions.

10.1 XUpdate and namespaces

In order to show how to use XUpdate to make namespace-aware modifications, The following tasks will be demonstrated:

  1. Add a new element in the products namespace, but using no prefix.

  2. Add a new element with a prefix and in the products namespace.

  3. Add a new element that is not in any namespace.

  4. Add a new global attribute in the XHTML namespace.

  5. Add a new global attribute in the special XML namespace.

  6. Add a new attribute in no namespace.

  7. Remove only the code element in the XHTML namespace

  8. Remove a global attribute

  9. Remove an attribute that is not in any namespace

Modification in place can always be simulated with an addition and then a removal. The following code shows how these tasks can be performed in XUpdate.

<xup:modifications version="1.0"
  xmlns:xup="http://www.xmldb.org/xupdate"
  xmlns:p="http://example.com/product-info"
  xmlns:html="http://www.w3.org/1999/xhtml"
  xmlns:xl="http://www.w3.org/1999/xlink"
>

  <!-- Task 1 -->
  <xup:append select="/products/p:product[1]">
    <xup:element
      name="launch-date"
      namespace="http://example.com/product-info"/>
  </xup:append>

  <!-- Task 2 -->
  <xup:append select="/products/p:product[1]">
    <xup:element
      name="p:launch-date"
      namespace="http://example.com/product-info"/>
  </xup:append>

  <!-- Can also be accomplished using literal result elements:
  <xup:append select="/products/p:product[1]">
    <p:launch-date/>
  </xup:append>
  -->

  <!-- Task 3 -->
  <xup:append select="/products/p:product[1]">
    <xup:element name="island"/>
  </xup:append>

  <!-- Can also be accomplished using literal result elements:
  <xup:append select="/products/p:product[1]">
    <island/>
  </xup:append>
  -->

  <!-- Task 4 -->
  <xup:append select="/products/p:product/p:description/html:div">
    <xup:attribute name="global"
      namespace="http://www.w3.org/1999/xhtml">spam</xup:attribute>
  </xup:append>

  <!-- Task 5 -->
  <xup:append select="/products/p:product/p:description/html:div">
    <xup:attribute name="xml:lang">en</xup:attribute>
  </xup:append>

  <!-- Task 6 -->
  <xup:append select="/products/p:product/p:description/html:div">
    <xup:attribute name="class">eggs</xup:attribute>
  </xup:append>

  <!-- Task 7 -->
  <xup:remove select="//html:code"/>

  <!-- Task 8 -->
  <xup:remove select="/products/p:product/p:description/html:div/ref/@xl:href"/>

  <!-- Task 9 -->
  <xup:remove select="/products/p:product[1]/@id"/>

</xup:modifications>

If you're familiar with XSLT, then you'll see the resemblance of XUpdate at first glance. The envelope element for modifications expressed in XUpdate is xup:modifications, similar to xsl:transform or xsl:stylesheet. The namespace declarations on this element assign prefixes for use in the XUpdate script and have no connection to the prefixes used in the document being modified (the source document), even though they happen to be the same. If you want to access elements in a namespace declared as the default in the source document, then just as in XSLT you must declare and use a prefix for the namespace in the XUpdate script.

Each modification request is expressed as an XUpdate instruction. This example demonstrates xup:append and xup:remove. There are other instructions providing types of modification such as xup:insert-before xup:update and there are also control constructs such as xup:if, which is similar to xsl:if. Instructions usually have a select attribute containing an XPath expression that specifies the node to be used as a reference for modification. In the case of xup:append, select specifies a node after which some new XML will be appended. In the case of xup:remove, select identifies nodes to be removed. When an instruction needs to specify a chunk of XML to be used in the modification it is expressed as the content of the instructions in a similar fashion to XSLT templates. In the case of xup:append this template expresses the chunk of XML to be inserted into the document. In order to generate elements and attributes XUpdate provides output instructions such as xup:element and xup:attribute, which are very similar to their XSLT equivalents. In another idea borrowed from XSLT, XUpdate allows you to create element by placing literal result elements in the templates. If you'd like to get a closer look at XUpdate, the best way is by browsing the very clear examples in the XUpdate Use Cases compiled by Kimbro Staken. The following listing is a Python code that can be used to apply an XUpdate script. It's a simplified version of the code for the 4xupdate command line.

import sys
from Ft.Xml import XUpdate
from Ft.Xml import Domlette, InputSource
from Ft.Lib import Uri

# Set up reader objects for parsing the XML files
reader = Domlette.NonvalidatingReader
xureader = XUpdate.Reader()

# Parse the source file
source_uri = Uri.OsPathToUri(sys.argv[1], attemptAbsolute=1)
source = reader.parseUri(source_uri)

# Parse the XUpdate file
xupdate_uri = Uri.OsPathToUri(sys.argv[2], attemptAbsolute=1)
isrc = InputSource.DefaultFactory.fromUri(xupdate_uri)
xupdate = xureader.fromSrc(isrc)

# Set up the XUpdate processor and run against the source file
# The Domlette for the source is modified in place
processor = XUpdate.Processor()
processor.execute(source, xupdate)

# Print the updated DOM node to standard output
Domlette.Print(source)

Notice the use of Uri.OsPathToUri to convert file system paths to proper URIs for use in 4Suite. I strongly recommend this convention as one way to help minimize confusion between file specifications and URIs -- the basis of many frequently asked questions. The XUpdate.Processor class defines the environment for running XUpdate commands and execute() is the method that actually kicks off the processing. It operates on a Domlette instance, modifying it in place (so be careful when using using XUpdate in this way). The updated document object is printed to standard output using Domlette.Print.

The following snippet illustrates how to run the test script, and the output result.

$ python listing4.py products.xml listing3.xup
<?xml version="1.0" encoding="UTF-8"?>
<products xmlns:p="http://example.com/product-info"
xmlns:html="http://www.w3.org/1999/xhtml"
xmlns:xl="http://www.w3.org/1999/xlink"
>
  <product xmlns="http://example.com/product-info">
    <name xml:lang="en">Python Perfect IDE</name>
    <description>
      Uses mind-reading technology to anticipate and accommodate
      all user needs in Python development.  Implements all
       features though
      the year 3000.  Works well with <code>1166</code>.
    </description>
  <launch-date/><p:launch-date/><island/></product>
  <p:product id="1166">
    <p:name>XSLT Perfect IDE</p:name>
    <p:description>
      <p:code>red</p:code>
      <html:code>blue</html:code>
      <html:div global="spam" class="eggs" xml:lang="en">
        <ref xl:type="simple">A link</ref>
      </html:div>
    </p:description>
  </p:product>
</products>

11 XInclude processing

11.1 About XInclude

XML Inclusions (XInclude) is a W3C Recommendation that provides XML document authors with a robust way of supporting document modularity via the use of transclusions (inclusions by reference). Such modularity would otherwise require using references to external entities declared in a DTD, a system which has various limitations inherited from SGML.

Unlike XML's built-in entity-reference system, the processing of XIncludes is, fundamentally, an XML Infoset transformation, not strictly an operation performed on the serialized (textual) form of a document. Therefore, there is no requirement for when and where XInclude processing should occur; it could happen at parse time if the parser supports it, or could occur on an already-parsed document.

XInclude references consist of two special elements that are placed in the XML document into which external content is to be included: <include> and <fallback>, both in the namespace http://www.w3.org/2001/XInclude. When processed, these elements are replaced with the content they reference, which can be XML or any other text.

11.2 XInclude support in 4Suite

4Suite supports XInclude processing only at parse time, as an optional feature of the Domlette readers. It is turned on by default, so if you want to suppress it, you must use the full parsing API — not the Ft.Xml.Parse and Ft.Xml.CreateInputSource convenience functions — and set the parameter processIncludes to False either when creating an InputSource or when calling the parseString, parseUri, or parseStream method of the Domlette reader.

11.3 Examples

The following example includes one section stub into a larger article but has to use the fallback for the second section stub, where resolution fails. “Document using XInclude” lists the contents of the file article.xml, which references two sections using XInclude and provides a fallback for each in case they fail to load. “Section to be included” lists the contents of section1.xml, but this example purposefully does not provide a section2.xml in order to illustrate the fallback behaviour. “Loading the document” lists the Python code used to parse and print this document; note that XInclude processing is done automatically by default.

<article>
  <title>My important article</title>
  <xi:include href="section1.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:fallback><!-- Section 1 failed to load! --></xi:fallback>
  </xi:include>
  <xi:include href="section2.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
    <xi:fallback><!-- Section 2 failed to load! --></xi:fallback>
  </xi:include>
</article>

Figure 1 — Document using XInclude

<section>
  <title>Section 1</title>
  <!-- Write me! -->
</section>

Figure 2 — Section to be included

from Ft.Xml import Parse
from Ft.Xml.Domlette import PrettyPrint
doc = Parse("article.xml")
PrettyPrint(doc)

Figure 3 — Loading the document

“Self-contained example” is very similar to the above example, only this version is self-contained; the resources are stored in Python strings and resolved using a custom resolver.

article = """<article><title>My important article</title>
<xi:include href="ex:section" xmlns:xi="http://www.w3.org/2001/XInclude">
  <xi:fallback><!-- Section 1 failed to load! --></xi:fallback>
</xi:include>
<xi:include href="ex:section2" xmlns:xi="http://www.w3.org/2001/XInclude">
  <xi:fallback><!-- Section 2 failed to load! --></xi:fallback>
</xi:include>
</article>"""

section = "<section><title>Section 1</title><!-- Write me! --></section>"

from Ft.Lib.Uri import FtUriResolver, Absolutize
from Ft.Lib import UriException
from cStringIO import StringIO
class MyResolver (FtUriResolver):
  def normalize(self, uriRef, baseUri):
    return Absolutize(uriRef, baseUri)
  def resolve(self, uri):
    if uri == "ex:article":
      return StringIO(article)
    elif uri == "ex:section":
      return StringIO(section)
    else:
      raise UriException(UriException.RESOURCE_ERROR,
                         loc=uri, msg="not found, sorry")

myResolver = MyResolver()

from Ft.Xml.InputSource import InputSourceFactory
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint
factory = InputSourceFactory(resolver=myResolver)
isrc = factory.fromUri("ex:article")
doc = NonvalidatingReader.parse(isrc)
PrettyPrint(doc)

Figure 4 — Self-contained example

To turn off XInclude behavior in “Self-contained example”, replace the last three lines with these three lines:

isrc = factory.fromUri("ex:article", processIncludes=False)
doc = NonvalidatingReader.parse(isrc)
PrettyPrint(doc)

“Loading the document” uses the "super simple" parsing API; we need to use the full parsing API in order to disable XInclude expansion (which, paradoxically, takes one less line):

from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint
doc = NonvalidatingReader.parseStream(file("article.xml"), processIncludes=False)
PrettyPrint(doc)

12 XPointer processing

12.1 About XPointer

XPointer is a set of W3C specifications (one part of which is, as of 2006, still a Working Draft) that provide a means of identifying and referring to a portion of an XML document. The portion being referenced need not be contiguous, and need not constitute a well-formed general entity. XPointers were originally intended to be used in the fragment component of a URI or IRI (the fragment being the part after "#"), but the specifications actually place no restrictions on where they can be used.

An example of an XPointer embedded in a URI would be

http://example.com/inventory.xml#xpointer(//part%5Bstarts-with(sku,%20'999')%5D)

The XPointer in that example is actually

xpointer(//part[starts-with(sku, '999')])

but the URI syntax requires further encoding of some data. The result of evaluating this XPointer would be the same as evaluating the XPath expression //part[starts-with(sku, '999')] against the document identified by the URI http://example.com/inventory.xml.

XPointer syntax is simple: a shorthand XPointer is just a name, and refers to the element with that ID (as determined by a DTD or other schema, typically), much like the XPath 1.0 expression id(somename), but with a little more flexibility, since id() is limited to DTD-based data typing.

A scheme-based XPointer consists of a series of one or more parts, separated by optional whitespace, with each part looking like a function call. What appear to be function names are actually syntactic and semantic schemes, of which the most common is the ID-oriented element scheme, and of which the most versatile is the XPath-oriented xpointer scheme.

If a scheme-based XPointer contains more than one part, then the parts are evaluated from left to right, skipping any unsupported/unrecognized schemes, until one is found that identifies something that exists in the document. Some schemes, like the namespace/prefix-binding xmlns, identify nothing (by design), and instead just influence the interpretation of subsequent parts. It's possible for an XPointer to produce different results with different processors, if the author doesn't take care to ensure each part identifies the same thing.

Here are some more examples:

The XPath 1.0 expression id(somename) means the same thing as the XPointer xpointer(id(somename)), and nearly the same thing as the XPointers element(somename) and somename, which just have more flexibility in where the ID can be drawn from.

The XPointer element(somename/3/1) means nearly the same thing as the XPath expression id(somename)/*[3]/*[1].

The XPointer xmlns(xhtml=http://www.w3.org/1999/xhtml)xpointer(//xhtml:a[@href]) could be used to refer to all of the links in an XHTML 1.0 document.

12.2 XPointer support in 4Suite

4Suite's XPointer implementation, sometimes called 4XPointer, has no command-line interface, but can be used within Python applications. It supports XPointers to different degrees, depending on the circumstances:

  1. When an XML document is being parsed into a Domlette with XInclude processing enabled, any XPointer encountered in an xi:include element is automatically evaluated when the included document is parsed. In this mode the XPointer must use an XPath LocationPath that only uses steps along the child axis. Furthermore, any predicates must be literal numbers, or must be of the specific form [@attname='attvalue']. For example, /foo[3] and /foo[@bar='baz'] will work, but ../foo and foo/[.='bar'] will not. Function calls are not allowed.

  2. If you have not yet parsed an XML document, but have a URI for it, then you can use Ft.Xml.XPointer.SelectUri() to parse the document and evaluate an XPointer embedded in the URI's fragment component. The parsing is performed with Domlette's default NonvalidatingReader instance. There are some implementation gaps to note when using the xpointer scheme: the only additional function fully supported is here(), and the following functions always return empty location-sets: string-range(), range-to(), start-point(), end-point(), and origin(). origin is illegal to use outside of extended XLinks, anyway.

  3. If you have already parsed the document into a Domlette, then you can evaluate an arbitrary XPointer against it by using Ft.Xml.XPointer.SelectNode(). The same implementation gaps as noted in the description of Ft.Xml.XPointer.SelectUri() apply.

Ranges are not supported because Domlette does not support DOM Level 2 Ranges. Uche Ogbuji posted some thoughts about this topic a while back. Also note that although the element scheme is streamable, it is not yet supported in XIncludes due to ID-related limitations in Domlette. Since element and shorthand pointer support are requirements for full XInclude conformance, they will probably be implemented in the future.

In 4Suite 1.0b1 and earlier, the implementation was based on older versions of the specs, and several additional restrictions were in effect: the element scheme was not even an option, XPointers in XIncludes had to be given via URIs (not attributes) and couldn't contain NameTests involving "*", and all other XPointers were only allowed to identify a single node.

12.3 Examples

The following example uses XInclude with XPointer references to include various sections from one document into another document. article.xml: Document using XInclude with XPointer references” lists the contents of the file article.xml, which references one section using a shorthand pointer and then references any sections that have their condition attribute set to unfinished. article2.xml: Document with content referenced from article.xml lists the contents of the file article2.xml, which is referenced from article.xml. “Loading the document” lists the Python code used to parse and print this document; note that XPointer processing is driven from XInclude processing, which is done automatically by default.

<article>
  <title>My important article</title>
  <xi:include href="article2.xml"
              xpointer="woo"
              xmlns:xi="http://www.w3.org/2001/XInclude"/>
  <xi:include href="article2.xml"
              xpointer="xpointer(article/section[@condition='unfinished'])"
              xmlns:xi="http://www.w3.org/2001/XInclude"/>
</article>

Figure 5 — article.xml: Document using XInclude with XPointer references

<article>
  <section condition="unfinished">
    <title>Section 1</title>
    <!-- Write me! -->
  </section>
  <section xml:id="woo">
    <title>Section 2</title>
    <para>Yeah, content.</para>
  </section>
  <section condition="unfinished">
    <title>Section 3</title>
    <!-- Write me, too! -->
  </section>
</article>

Figure 6 — article2.xml: Document with content referenced from article.xml

from Ft.Xml import Parse
from Ft.Xml.Domlette import PrettyPrint
doc = Parse("article.xml")
PrettyPrint(doc)

Figure 7 — Loading the document

As mentioned earlier, XPointer is most commonly used along with XInclude, but 4Suite provides an API for using XPointer directly from Python. Using article2.xml as listed above in article2.xml: Document with content referenced from article.xml, “Using XPointer directly from Python” loads two of the nodes loaded previously with XInclude. Note that when using the standalone interface, the code is able to take advantage of more of the XPointer syntax.

from Ft.Xml import Parse
from Ft.Xml.Domlette import PrettyPrint
from Ft.Xml.XPointer import SelectNode

article2 = Parse("article2.xml")
# Shorthand XPointer syntax
node = SelectNode(article2, "woo")[0]
PrettyPrint(node)
# Scheme-based XPointer syntax
node = SelectNode(article2,
                  "xpointer(//section[@condition='unfinished'][2])")[0]
PrettyPrint(node)

Figure 8 — Using XPointer directly from Python

“Self-contained example” is very similar to the examples above, only this version is self-contained; the resources are stored in Python strings and resolved using a custom resolver.

article = """<article><title>My important article</title>
<xi:include href="ex:article2"
            xpointer="woo"
            xmlns:xi="http://www.w3.org/2001/XInclude"/>
<xi:include href="ex:article2"
            xpointer="xpointer(article/section[@condition='unfinished'])"
            xmlns:xi="http://www.w3.org/2001/XInclude"/>
</article>"""

article2 = """<article>
<section condition="unfinished"><title>Section 1</title><!-- Write me! --></section>
<section xml:id="woo"><title>Section 2</title><para>Yeah, content.</para></section>
<section condition="unfinished"><title>Section 3</title><!-- Write me, too! --></section>
</article>"""

from Ft.Lib.Uri import FtUriResolver, Absolutize
from Ft.Lib import UriException
from cStringIO import StringIO
class MyResolver (FtUriResolver):
  def normalize(self, uriRef, baseUri):
    return Absolutize(uriRef, baseUri)
  def resolve(self, uri):
    if uri == "ex:article":
      return StringIO(article)
    elif uri == "ex:article2":
      return StringIO(article2)
    else:
      raise UriException(UriException.RESOURCE_ERROR,
                         loc=uri, msg="not found, sorry")

myResolver = MyResolver()

from Ft.Xml.InputSource import InputSourceFactory
from Ft.Xml.Domlette import NonvalidatingReader, PrettyPrint
factory = InputSourceFactory(resolver=myResolver)
isrc = factory.fromUri("ex:article")
doc = NonvalidatingReader.parse(isrc)
PrettyPrint(doc)

from Ft.Xml.XPointer import SelectNode

isrc = factory.fromUri("ex:article2")
article2 = NonvalidatingReader.parse(isrc)
node = SelectNode(article2, "woo")[0]
PrettyPrint(node)
node = SelectNode(article2,
                  "xpointer(//section[@condition='unfinished'][2])")[0]
PrettyPrint(node)

Figure 9 — Self-contained example

13 Comprehensive examples

This section contains a set of examples that transcend the boundaries of individual topics. These examples combine multiple different techniques and often address more common use-cases found "in the wild".

13.1 Transforming DocBook using the DocBook XSL stylesheets

In the XML universe, one common use-case is converting DocBook (a common XML application) to various output formats for publishing using the DocBook XSL set of XSLT scripts. If you have the DocBook XSL distribution installed (or if you have an Internet connection), you can transform DocBook files completely within the 4Suite XML API. The following example illustrates how this can be done, and in the process this example touches on a wide variety of 4Suite XML techniques. These are listed below.

  • Building a Domlette XML model manually

  • Parsing XML into a Domlette XML model

  • Using XSLT in 4Suite XML

  • Using InputSources with automatic XML Catalog resolution

  • Managing URIs

  • Writing XML from a Domlette XML model

  • And a bonus feature unrelated to 4Suite: i18n with the DocBook XSL scripts!

from Ft.Xml.Domlette import implementation, PrettyPrint, NonvalidatingReader
from Ft.Xml.Xslt import Processor
from Ft.Xml import Catalog, InputSource, EMPTY_NAMESPACE
from Ft.Lib import Uri, UriException

# New processor
processor = Processor.Processor()

# If you have the DocBook XSL scripts installed in your system, then they are likely
# integrated into the system catalog, which is often at `/etc/xml/catalog` on
# Unix-like systems.  If you have a catalog which resolves the DocBook XSL URIs
# located in a different filename, you can change this filename below.  Otherwise,
# this example will access the DocBook XSL scripts directly (i.e. over the network).
catalog_filename = '/etc/xml/catalog'
# Turn the catalog filename into the corresponding `file` URI.
catalog_URI = Uri.OsPathToUri(catalog_filename)
# Try to load the catalog, moving right along if it doesn't exist.
theCatalog = None
try:
  theCatalog = Catalog.Catalog(catalog_URI)
except UriException, e:
  pass

# Create a new `InputSourceFactory` object to use our catalog.
inputSourceFactory = InputSource.InputSourceFactory(catalog = theCatalog)
docbook_xsl_URI = 'http://docbook.sourceforge.net/release/xsl/current/html/docbook.xsl'
# Set up an `InputSource` for the DocBook XSL stylesheets.
docbook_xsl_source = inputSourceFactory.fromUri(docbook_xsl_URI)
# Build a DOM of our stylesheet, then load the stylesheet into the XSLT processor.
transform = NonvalidatingReader.parse(docbook_xsl_source)
processor.appendStylesheetNode(transform, docbook_xsl_URI)

# Now we build our DocBook DOM, with a document root of myDoc.
myDoc = implementation.createRootNode('file:///article.xml')
article = myDoc.createElementNS(EMPTY_NAMESPACE,  'article')
myDoc.appendChild(article)
article.setAttributeNS(None, 'lang', "es")
myDoc.publicId="-//OASIS//DTD DocBook XML V4.2//EN"
myDoc.systemId="http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"

element = myDoc.createElementNS(EMPTY_NAMESPACE, 'title')
element.appendChild(myDoc.createTextNode('Title of article'))
article.appendChild(element)

section = myDoc.createElementNS(EMPTY_NAMESPACE, 'section')
article.appendChild(section)

element = myDoc.createElementNS(EMPTY_NAMESPACE, 'title')
element.appendChild(myDoc.createTextNode('Title of section'))
section.appendChild(element)

element = myDoc.createElementNS(EMPTY_NAMESPACE, 'para')
element.appendChild(myDoc.createTextNode('paragraph of section'))
section.appendChild(element)

print '************************ xml *******************************'
# Serialize the source document as XML.
PrettyPrint(myDoc)

print '************************ html *******************************'
# Print the result of transforming the document.
result = processor.runNode(myDoc)
print result

14 Resources

Sources of additional information

More on DOMs in Python: Basic DOM processing

External encoding declarations

[XML Catalogs|http://uche.ogbuji.net/tech/akara/nodes/2004-06-12/external-encoding]

There is more coverage of the 4Suite XPath package in this Tour of 4Suite.

This slide and the following from Alexandre Fayolles' excellent EuroPython 2002 tutorial on Python/XML processing is an great introduction to XPath and XSLT processing in Python.

This XPath and 4XPath tutorial is a bit dated, but very comprehensive. Free registration is required.

You can use EXSLT's node-set extension to provide functionality much like transform chaining. FOr more details see "Tip: Multi-pass XSLT"

For more on RELAX NG in general, see The official RELAX NG tutorial.

For more on XVIF, see this follow-up by Eric.

I use 4xml's --rng option in examples in my article on Examplotron

If you want to try out online 4suite and RelaxNG, go to http://www.defuze.org/oss/tree/

This article discusses MarkupWriter

For more examples of MarkupWriter, see:

See this #4suite blog entry for another example of XPath extensions.

Tamito KAJIYAMA responds to a thread discussing the grouped sorting XSLT FAQ in 4XSLT, offering an extension function as a possible solution.