4Suite API Documentation

Module Ft.Lib.Uri

Classes and functions related to URI validation and resolution
APIs that currently differentiate between Unicode and byte strings are
considered to be experimental; do not count on their uniformity between
releases.

Copyright 2005 Fourthought, Inc. (USA).
Detailed license and copyright information: http://4suite.org/COPYRIGHT
Project home, documentation, distributions: http://4suite.org/
Classes:
Functions:
Globals:

Classes

class FtUriResolver(UriResolverBase)
The URI resolver class used by most of 4Suite, outside of the repository.
Adds support for lenient processing of base URIs.

Methods

normalize(self, uriRef, baseUri)
This function differs from UriResolverBase.normalize() in the following manner:
This function allows for the possibility of the base URI beginning
with a '/', in which case the argument is assumed to be an absolute
path component of 'file' URI that has no authority component.
Overrides: normalize from class UriResolverBase

Methods inherited from class UriResolverBase

class UriDict(dict)
A dictionary that uses URIs as keys. It attempts to observe some degree of URI equivalence as defined in RFC 3986 section 6. For example, if URIs A and B are equivalent, a dictionary operation involving key B will return the same result as one involving key A, and vice-versa.
This is useful in situations where retrieval of a new representation of a
resource is undesirable for equivalent URIs, such as "file:///x" and
"file://localhost/x" (see RFC 1738), or "http://spam/~x/",
"http://spam/%7Ex/" and "http://spam/%7ex" (see RFC 3986).

Normalization performed includes case normalization on the scheme and
percent-encoded octets, percent-encoding normalization (decoding of
octets corresponding to unreserved characters), and the reduction of
'file://localhost/' to 'file:///', in accordance with both RFC 1738 and
RFC 3986 (although RFC 3986 encourages using 'localhost' and doing
this for all schemes, not just file).

An instance of this class is used by Ft.Xml.Xslt.XsltContext for caching
documents, so that the XSLT function document() will return identical
nodes, without refetching/reparsing, for equivalent URIs.

Methods

__contains__(self, key)
Overrides: __contains__ from class dict
__delitem__(self, key)
Overrides: __delitem__ from class dict
__getitem__(self, key)
Overrides: __getitem__ from class dict
__iter__(self)
Overrides: __iter__ from class dict
__setitem__(self, key, value)
Overrides: __setitem__ from class dict
has_key(self, key)
Overrides: has_key from class dict
iteritems(self)
Overrides: iteritems from class dict
iterkeys = __iter__(self)
Overrides: iterkeys from class dict

Methods inherited from class dict

__cmp__, __eq__, __ge__, __getattribute__, __gt__, __hash__, __init__, __le__, __len__, __lt__, __ne__, __new__, __repr__, clear, copy, get, items, itervalues, keys, pop, popitem, setdefault, update, values

Methods inherited from class object

__delattr__, __reduce__, __reduce_ex__, __setattr__, __str__

Members

__dict__ = <attribute '__dict__' of 'UriDict' objects>
__weakref__ = <attribute '__weakref__' of 'UriDict' objects>

Members inherited from class dict

fromkeys

Members inherited from class object

__class__
class UriResolverBase
#============================================================================= # Extendable normalization and resolution functions for URI references

Methods

__init__(self)
generate(self, hint=None)
This function generates and returns a URI. The hint is an object that helps decide what to generate. The default action is to generate a random UUID URN.
normalize(self, uriRef, baseUri)
Resolves a URI reference to absolute form, effecting the result of RFC 3986 section 5. The URI reference is considered to be relative to the given base URI.
Also verifies that the resulting URI reference has a scheme that
resolve() supports, raising a UriException if it doesn't.

The default implementation does not perform any validation on the base
URI beyond that performed by Absolutize().
resolve(self, uri, baseUri=None)
This function takes a URI or a URI reference plus a base URI, produces a normalized URI using the normalize function if a base URI was given, then attempts to obtain access to an entity representing the resource identified by the resulting URI, returning the entity as a stream (a Python file-like object).
Raises a UriException if the URI scheme is unsupported or if a stream
could not be obtained for any reason.

Functions

Absolutize(uriRef, baseUri)
Resolves a URI reference to absolute form, effecting the result of RFC 3986 section 5. The URI reference is considered to be relative to the given base URI.
It is the caller's responsibility to ensure that the base URI matches
the absolute-URI syntax rule of RFC 3986, and that its path component
does not contain '.' or '..' segments if the scheme is hierarchical.
Unexpected results may occur otherwise.

This function only conducts a minimal sanity check in order to determine
if relative resolution is possible: it raises a UriException if the base
URI does not have a scheme component. While it is true that the base URI
is irrelevant if the URI reference has a scheme, an exception is raised
in order to signal that the given string does not even come close to
meeting the criteria to be usable as a base URI.

It is the caller's responsibility to make a determination of whether the
URI reference constitutes a "same-document reference", as defined in RFC
2396 or RFC 3986. As per the spec, dereferencing a same-document
reference "should not" involve retrieval of a new representation of the
referenced resource. Note that the two specs have different definitions
of same-document reference: RFC 2396 says it is *only* the cases where the
reference is the empty string, or "#" followed by a fragment; RFC 3986
requires making a comparison of the base URI to the absolute form of the
reference (as is returned by the spec), minus its fragment component,
if any.

This function is similar to urlparse.urljoin() and urllib.basejoin().
Those functions, however, are (as of Python 2.3) outdated, buggy, and/or
designed to produce results acceptable for use with other core Python
libraries, rather than being earnest implementations of the relevant
specs. Their problems are most noticeable in their handling of
same-document references and 'file:' URIs, both being situations that
come up far too often to consider the functions reliable enough for
general use.
BaseJoin(base, uriRef)
Merges a base URI reference with another URI reference, returning a new URI reference.
It behaves exactly the same as Absolutize(), except the arguments
are reversed, and it accepts any URI reference (even a relative URI)
as the base URI. If the base has no scheme component, it is
evaluated as if it did, and then the scheme component of the result
is removed from the result, unless the uriRef had a scheme. Thus, if
neither argument has a scheme component, the result won't have one.

This function is named BaseJoin because it is very much like
urllib.basejoin(), but it follows the current RFC 3986 algorithms
for path merging, dot segment elimination, and inheritance of query
and fragment components.

WARNING: This function exists for 2 reasons: (1) because of a need
within the 4Suite repository to perform URI reference absolutization
using base URIs that are stored (inappropriately) as absolute paths
in the subjects of statements in the RDF model, and (2) because of
a similar need to interpret relative repo paths in a 4Suite product
setup.xml file as being relative to a path that can be set outside
the document. When these needs go away, this function probably will,
too, so it is not advisable to use it.
GetScheme(uriRef)
Obtains, with optimum efficiency, just the scheme from a URI reference. Returns a string, or if no scheme could be found, returns None.
IsAbsolute(identifier)
Given a string believed to be a URI or URI reference, tests that it is absolute (as per RFC 3986), not relative -- i.e., that it has a scheme.
MakeUrllibSafe(uriRef)
Makes the given RFC 3986-conformant URI reference safe for passing to legacy urllib functions. The result may not be a valid URI.
As of Python 2.3.3, urllib.urlopen() does not fully support
internationalized domain names, it does not strip fragment components,
and on Windows, it expects file URIs to use '|' instead of ':' in the
path component corresponding to the drivespec. It also relies on
urllib.unquote(), which mishandles unicode arguments. This function
produces a URI reference that will work around these issues, although
the IDN workaround is limited to Python 2.3 only. May raise a
UnicodeEncodeError if the URI reference is Unicode and erroneously
contains non-ASCII characters.
MatchesUriRefSyntax(s)
This function returns true if the given string could be a URI reference, as defined in RFC 3986, just based on the string's syntax.
A URI reference can be a URI or certain portions of one, including the
empty string, and it can have a fragment component.
MatchesUriSyntax(s)
This function returns true if the given string could be a URI, as defined in RFC 3986, just based on the string's syntax.
A URI is by definition absolute (begins with a scheme) and does not end
with a #fragment. It also must adhere to various other syntax rules.
NormalizeCase(uriRef, doHost=False)
Returns the given URI reference with the case of the scheme, percent-encoded octets, and, optionally, the host, all normalized, implementing section 6.2.2.1 of RFC 3986. The normal form of scheme and host is lowercase, and the normal form of percent-encoded octets is uppercase.
The URI reference can be given as either a string or as a sequence as
would be provided by the SplitUriRef function. The return value will
be a string or tuple.
NormalizePathSegments(path)
Given a string representing the path component of a URI reference having a hierarchical scheme, returns the string with dot segments ('.' and '..') removed, implementing section 6.2.2.3 of RFC 3986. If the path is relative, it is returned with no changes.
NormalizePathSegmentsInUri(uri)
Given a string representing a URI or URI reference having a hierarchical scheme, returns the string with dot segments ('.' and '..') removed from the path component, implementing section 6.2.2.3 of RFC 3986. If the path is relative, the URI or URI reference is returned with no changes.
NormalizePercentEncoding(s)
Given a string representing a URI reference or a component thereof, returns the string with all percent-encoded octets that correspond to unreserved characters decoded, implementing section 6.2.2.2 of RFC 3986.
OsPathToUri(path, attemptAbsolute=True, osname=None)
This function converts an OS-specific file system path to a URI of the form 'file:///path/to/the/file'.
In addition, if the path is absolute, any dot segments ('.' or '..') will
be collapsed, so that the resulting URI can be safely used as a base URI
by functions such as Absolutize().

The given path will be interpreted as being one that is appropriate for
use on the local operating system, unless a different osname argument is
given.

If the given path is relative, an attempt may be made to first convert
the path to absolute form by interpreting the path as being relative
to the current working directory.  This is the case if the attemptAbsolute
flag is True (the default).  If attemptAbsolute is False, a relative
path will result in a URI of the form file:relative/path/to/a/file .

attemptAbsolute has no effect if the given path is not for the
local operating system.

On Windows, the drivespec will become the first step in the path component
of the URI. If the given path contains a UNC hostname, this name will be
used for the authority component of the URI.

Warning: Some libraries, such as urllib.urlopen(), may not behave as
expected when given a URI generated by this function. On Windows you may
want to call re.sub('(/[A-Za-z]):', r'\1|', uri) on the URI to prepare it
for use by functions such as urllib.url2pathname() or urllib.urlopen().

This function is similar to urllib.pathname2url(), but is more featureful
and produces better URIs.
PathResolve(paths)
This function takes a list of file URIs. The first can be absolute or relative to the URI equivalent of the current working directory. The rest must be relative to the first. The function converts them all to OS paths appropriate for the local system, and then creates a single final path by resolving each path in the list against the following one. This final path is returned as a URI.
PercentDecode(s, encoding='utf-8', decodable=None)
[*** Experimental API ***] Reverses the percent-encoding of the given string.
This function is similar to urllib.unquote(), but can also process a
Unicode string, not just a regular byte string.

By default, all percent-encoded sequences are decoded, but if a byte
string is given via the 'decodable' argument, only the sequences
corresponding to those octets will be decoded.

If the string is Unicode, the percent-encoded sequences are converted to
bytes, then converted back to Unicode according to the encoding given in
the encoding argument. For example, by default, u'abc%E2%80%A2' will be
converted to u'abc\u2022', because byte sequence E2 80 A2 represents
character U+2022 in UTF-8.

If the string is not Unicode, the percent-encoded octets are just
converted to bytes, and the encoding argument is ignored. For example,
'abc%E2%80%A2' will be converted to 'abcâ¢'.

This function is intended for use on the portions of a URI that are
delimited by reserved characters (see PercentEncode), or on a value from
data of media type application/x-www-form-urlencoded.
PercentEncode(s, encoding='utf-8', encodeReserved=True, spaceToPlus=False, nlChars=None, reservedChars="/=&+?#;@,:$!*[]()'")
[*** Experimental API ***] This function applies percent-encoding, as described in RFC 3986 sec. 2.1, to the given string, in order to prepare the string for use in a URI. It replaces characters that are not allowed in a URI. By default, it also replaces characters in the reserved set, which normally includes the generic URI component delimiters ":" "/" "?" "#" "[" "]" "@" and the subcomponent delimiters "!" "$" "&" "'" "(" ")" "*" "+" "," ";" "=".
Ideally, this function should be used on individual components or
subcomponents of a URI prior to assembly of the complete URI, not
afterward, because this function has no way of knowing which characters
in the reserved set are being used for their reserved purpose and which
are part of the data. By default it assumes that they are all being used
as data, thus they all become percent-encoded.

The characters in the reserved set can be overridden from the default by
setting the reservedChars argument. The percent-encoding of characters
in the reserved set can be disabled by unsetting the encodeReserved flag.
Do this if the string is an already-assembled URI or a URI component,
such as a complete path.

If the given string is Unicode, the name of the encoding given in the
encoding argument will be used to determine the percent-encoded octets
for characters that are not in the U+0000 to U+007F range. The codec
identified by the encoding argument must return a byte string.

If the given string is not Unicode, the encoding argument is ignored and
the string is interpreted to represent literal octets, rather than
characters. Octets above \x7F will be percent-encoded as-is, e.g., \xa0
becomes %A0, not, say, %C2%A0.

The spaceToPlus flag controls whether space characters are changed to
"+" characters in the result, rather than being percent-encoded.
Generally, this is not required, and given the status of "+" as a
reserved character, is often undesirable. But it is required in certain
situations, such as when generating application/x-www-form-urlencoded
content or RFC 3151 public identifier URNs, so it is supported here.

The nlChars argument, if given, is a sequence type in which each member
is a substring that indicates a "new line". Occurrences of this substring
will be replaced by '%0D%0A' in the result, as is required when generating
application/x-www-form-urlencoded content.

This function is similar to urllib.quote(), but is more conformant and
Unicode-friendly. Suggestions for improvements welcome.
PublicIdToUrn(publicid)
Converts a public identifier to a URN that conforms to RFC 3151.
Relativize(targetUri, againstUri, subPathOnly=False)
This method returns a relative URI that is consistent with `targetURI` when resolved against `againstUri`. If no such relative URI exists, for whatever reason, this method returns `None`.
To be precise, if a string called `rel` exists such that
``Absolutize(rel, againstUri) == targetUri``, then `rel` is returned by
this function.  In these cases, `Relativize` is in a sense the inverse
of `Absolutize`.  In all other cases, `Relativize` returns `None`.

The following idiom may be useful for obtaining compliant relative
reference strings (e.g. for `path`) for use in other methods of this
package::

  path = Relativize(OsPathToUri(path), OsPathToUri('.'))

If `subPathOnly` is `True`, then this method will only return a relative
reference if such a reference exists relative to the last hierarchical
segment of `againstUri`.  In particular, this relative reference will
not start with '/' or '../'.
RemoveDotSegments(path)
Supports Absolutize() by implementing the remove_dot_segments function described in RFC 3986 sec. 5.2. It collapses most of the '.' and '..' segments out of a path without eliminating empty segments. It is intended to be used during the path merging process and may not give expected results when used independently. Use NormalizePathSegments() or NormalizePathSegmentsInUri() if more general normalization is desired.
SplitAuthority(authority)
Given a string representing the authority component of a URI, returns a tuple consisting of the subcomponents (userinfo, host, port). No percent-decoding is performed.
SplitFragment(uri)
Given a URI or URI reference, returns a tuple consisting of (base, fragment), where base is the portion before the '#' that precedes the fragment component.
SplitUriRef(uriref)
Given a valid URI reference as a string, returns a tuple representing the generic URI components, as per RFC 3986 appendix B. The tuple's structure is (scheme, authority, path, query, fragment).
All values will be strings (possibly empty) or None if undefined.

Note that per RFC 3986, there is no distinction between a path and
an "opaque part", as there was in RFC 2396.
StripFragment(uriRef)
Returns the given URI or URI reference with the fragment component, if any, removed.
UnsplitUriRef(uriRefSeq)
Given a sequence as would be produced by SplitUriRef(), assembles and returns a URI reference as a string.
UriToOsPath(uri, attemptAbsolute=True, encoding='utf-8', osname=None)
This function converts a URI reference to an OS-specific file system path.
If the URI reference is given as a Unicode string, then the encoding
argument determines how percent-encoded components are interpreted, and
the result will be a Unicode string. If the URI reference is a regular
byte string, the encoding argument is ignored and the result will be a
byte string in which percent-encoded octets have been converted to the
bytes they represent. For example, the trailing path segment of
u'file:///a/b/%E2%80%A2' will by default be converted to u'\u2022',
because sequence E2 80 A2 represents character U+2022 in UTF-8. If the
string were not Unicode, the trailing segment would become the 3-byte
string '\xe2\x80\xa2'.

The osname argument determines for what operating system the resulting
path is appropriate. It defaults to os.name and is typically the value
'posix' on Unix systems (including Mac OS X and Cygwin), and 'nt' on
Windows NT/2000/XP.

This function is similar to urllib.url2pathname(), but is more featureful
and produces better paths.

If the given URI reference is not relative, its scheme component must be
'file', and an exception will be raised if it isn't.

In accordance with RFC 3986, RFC 1738 and RFC 1630, an authority
component that is the string 'localhost' will be treated the same as an
empty authority.

Dot segments ('.' or '..') in the path component are NOT collapsed.

If the path component of the URI reference is relative and the
attemptAbsolute flag is True (the default), then the resulting path
will be made absolute by considering the path to be relative to the
current working directory. There is no guarantee that such a result
will be an accurate interpretation of the URI reference.

attemptAbsolute has no effect if the
result is not being produced for the local operating system.

Fragment and query components of the URI reference are ignored.

If osname is 'posix', the authority component must be empty or just
'localhost'. An exception will be raised otherwise, because there is no
standard way of interpreting other authorities. Also, if '%2F' is in a
path segment, it will be converted to r'\/' (a backslash-escaped forward
slash). The caller may need to take additional steps to prevent this from
being interpreted as if it were a path segment separator.

If osname is 'nt', a drivespec is recognized as the first occurrence of a
single letter (A-Z, case-insensitive) followed by '|' or ':', occurring as
either the first segment of the path component, or (incorrectly) as the
entire authority component. A UNC hostname is recognized as a non-empty,
non-'localhost' authority component that has not been recognized as a
drivespec, or as the second path segment if the first path segment is
empty. If a UNC hostname is detected, the result will begin with
'\\<hostname>\'. If a drivespec was detected also, the first path segment
will be '$<driveletter>$'. If a drivespec was detected but a UNC hostname
was not, then the result will begin with '<driveletter>:'.

Windows examples:
'file:x/y/z' => r'x\y\z';
'file:/x/y/z' (not recommended) => r'\x\y\z';
'file:///x/y/z' => r'\x\y\z';
'file:///c:/x/y/z' => r'C:\x\y\z';
'file:///c|/x/y/z' => r'C:\x\y\z';
'file:///c:/x:/y/z' => r'C:\x:\y\z' (bad path, valid interpretation);
'file://c:/x/y/z' (not recommended) => r'C:\x\y\z';
'file://host/share/x/y/z' => r'\\host\share\x\y\z';
'file:////host/share/x/y/z' => r'\\host\share\x\y\z'
'file://host/x:/y/z' => r'\\host\x:\y\z' (bad path, valid interp.);
'file://localhost/x/y/z' => r'\x\y\z';
'file://localhost/c:/x/y/z' => r'C:\x\y\z';
'file:///C:%5Cx%5Cy%5Cz' (not recommended) => r'C:\x\y\z'
UrlOpen(url, *args, **kwargs)
A replacement/wrapper for urllib2.urlopen().
Simply calls MakeUrllibSafe() on the given URL and passes the result
and all other args to urllib2.urlopen().
UrnToPublicId(urn)
Converts a URN that conforms to RFC 3151 to a public identifier.
For example, the URN
"urn:publicid:%2B:IDN+example.org:DTD+XML+Bookmarks+1.0:EN:XML"
will be converted to the public identifier
"+//IDN example.org//DTD XML Bookmarks 1.0//EN//XML"

Raises a UriException if the given URN cannot be converted.
Query and fragment components, if present, are ignored.

Globals

BASIC_RESOLVER = <Ft.Lib.Uri.FtUriResolver instance>
DEFAULT_URI_SCHEMES = ('http', 'https', 'file', 'ftp', 'data', 'pep302')
WINDOWS_SLASH_COMPAT = True