APIs that currently differentiate between Unicode and byte strings are considered to be experimental; do not count on their uniformity between releases. Copyright 2005 Fourthought, Inc. (USA). Detailed license and copyright information: http://4suite.org/COPYRIGHT Project home, documentation, distributions: http://4suite.org/
Absolutize, BaseJoin, GetScheme, IsAbsolute, MakeUrllibSafe, MatchesUriRefSyntax, MatchesUriSyntax, NormalizeCase, NormalizePathSegments, NormalizePathSegmentsInUri, NormalizePercentEncoding, OsPathToUri, PathResolve, PercentDecode, PercentEncode, PublicIdToUrn, Relativize, RemoveDotSegments, SplitAuthority, SplitFragment, SplitUriRef, StripFragment, UnsplitUriRef, UriToOsPath, UrlOpen, UrnToPublicId
Adds support for lenient processing of base URIs.
This function allows for the possibility of the base URI beginning with a '/', in which case the argument is assumed to be an absolute path component of 'file' URI that has no authority component.
This is useful in situations where retrieval of a new representation of a resource is undesirable for equivalent URIs, such as "file:///x" and "file://localhost/x" (see RFC 1738), or "http://spam/~x/", "http://spam/%7Ex/" and "http://spam/%7ex" (see RFC 3986). Normalization performed includes case normalization on the scheme and percent-encoded octets, percent-encoding normalization (decoding of octets corresponding to unreserved characters), and the reduction of 'file://localhost/' to 'file:///', in accordance with both RFC 1738 and RFC 3986 (although RFC 3986 encourages using 'localhost' and doing this for all schemes, not just file). An instance of this class is used by Ft.Xml.Xslt.XsltContext for caching documents, so that the XSLT function document() will return identical nodes, without refetching/reparsing, for equivalent URIs.
Also verifies that the resulting URI reference has a scheme that resolve() supports, raising a UriException if it doesn't. The default implementation does not perform any validation on the base URI beyond that performed by Absolutize().
Raises a UriException if the URI scheme is unsupported or if a stream could not be obtained for any reason.
It is the caller's responsibility to ensure that the base URI matches the absolute-URI syntax rule of RFC 3986, and that its path component does not contain '.' or '..' segments if the scheme is hierarchical. Unexpected results may occur otherwise. This function only conducts a minimal sanity check in order to determine if relative resolution is possible: it raises a UriException if the base URI does not have a scheme component. While it is true that the base URI is irrelevant if the URI reference has a scheme, an exception is raised in order to signal that the given string does not even come close to meeting the criteria to be usable as a base URI. It is the caller's responsibility to make a determination of whether the URI reference constitutes a "same-document reference", as defined in RFC 2396 or RFC 3986. As per the spec, dereferencing a same-document reference "should not" involve retrieval of a new representation of the referenced resource. Note that the two specs have different definitions of same-document reference: RFC 2396 says it is *only* the cases where the reference is the empty string, or "#" followed by a fragment; RFC 3986 requires making a comparison of the base URI to the absolute form of the reference (as is returned by the spec), minus its fragment component, if any. This function is similar to urlparse.urljoin() and urllib.basejoin(). Those functions, however, are (as of Python 2.3) outdated, buggy, and/or designed to produce results acceptable for use with other core Python libraries, rather than being earnest implementations of the relevant specs. Their problems are most noticeable in their handling of same-document references and 'file:' URIs, both being situations that come up far too often to consider the functions reliable enough for general use.
It behaves exactly the same as Absolutize(), except the arguments are reversed, and it accepts any URI reference (even a relative URI) as the base URI. If the base has no scheme component, it is evaluated as if it did, and then the scheme component of the result is removed from the result, unless the uriRef had a scheme. Thus, if neither argument has a scheme component, the result won't have one. This function is named BaseJoin because it is very much like urllib.basejoin(), but it follows the current RFC 3986 algorithms for path merging, dot segment elimination, and inheritance of query and fragment components. WARNING: This function exists for 2 reasons: (1) because of a need within the 4Suite repository to perform URI reference absolutization using base URIs that are stored (inappropriately) as absolute paths in the subjects of statements in the RDF model, and (2) because of a similar need to interpret relative repo paths in a 4Suite product setup.xml file as being relative to a path that can be set outside the document. When these needs go away, this function probably will, too, so it is not advisable to use it.
As of Python 2.3.3, urllib.urlopen() does not fully support internationalized domain names, it does not strip fragment components, and on Windows, it expects file URIs to use '|' instead of ':' in the path component corresponding to the drivespec. It also relies on urllib.unquote(), which mishandles unicode arguments. This function produces a URI reference that will work around these issues, although the IDN workaround is limited to Python 2.3 only. May raise a UnicodeEncodeError if the URI reference is Unicode and erroneously contains non-ASCII characters.
A URI reference can be a URI or certain portions of one, including the empty string, and it can have a fragment component.
A URI is by definition absolute (begins with a scheme) and does not end with a #fragment. It also must adhere to various other syntax rules.
The URI reference can be given as either a string or as a sequence as would be provided by the SplitUriRef function. The return value will be a string or tuple.
In addition, if the path is absolute, any dot segments ('.' or '..') will be collapsed, so that the resulting URI can be safely used as a base URI by functions such as Absolutize(). The given path will be interpreted as being one that is appropriate for use on the local operating system, unless a different osname argument is given. If the given path is relative, an attempt may be made to first convert the path to absolute form by interpreting the path as being relative to the current working directory. This is the case if the attemptAbsolute flag is True (the default). If attemptAbsolute is False, a relative path will result in a URI of the form file:relative/path/to/a/file . attemptAbsolute has no effect if the given path is not for the local operating system. On Windows, the drivespec will become the first step in the path component of the URI. If the given path contains a UNC hostname, this name will be used for the authority component of the URI. Warning: Some libraries, such as urllib.urlopen(), may not behave as expected when given a URI generated by this function. On Windows you may want to call re.sub('(/[A-Za-z]):', r'\1|', uri) on the URI to prepare it for use by functions such as urllib.url2pathname() or urllib.urlopen(). This function is similar to urllib.pathname2url(), but is more featureful and produces better URIs.
This function is similar to urllib.unquote(), but can also process a Unicode string, not just a regular byte string. By default, all percent-encoded sequences are decoded, but if a byte string is given via the 'decodable' argument, only the sequences corresponding to those octets will be decoded. If the string is Unicode, the percent-encoded sequences are converted to bytes, then converted back to Unicode according to the encoding given in the encoding argument. For example, by default, u'abc%E2%80%A2' will be converted to u'abc\u2022', because byte sequence E2 80 A2 represents character U+2022 in UTF-8. If the string is not Unicode, the percent-encoded octets are just converted to bytes, and the encoding argument is ignored. For example, 'abc%E2%80%A2' will be converted to 'abcâ€¢'. This function is intended for use on the portions of a URI that are delimited by reserved characters (see PercentEncode), or on a value from data of media type application/x-www-form-urlencoded.
Ideally, this function should be used on individual components or subcomponents of a URI prior to assembly of the complete URI, not afterward, because this function has no way of knowing which characters in the reserved set are being used for their reserved purpose and which are part of the data. By default it assumes that they are all being used as data, thus they all become percent-encoded. The characters in the reserved set can be overridden from the default by setting the reservedChars argument. The percent-encoding of characters in the reserved set can be disabled by unsetting the encodeReserved flag. Do this if the string is an already-assembled URI or a URI component, such as a complete path. If the given string is Unicode, the name of the encoding given in the encoding argument will be used to determine the percent-encoded octets for characters that are not in the U+0000 to U+007F range. The codec identified by the encoding argument must return a byte string. If the given string is not Unicode, the encoding argument is ignored and the string is interpreted to represent literal octets, rather than characters. Octets above \x7F will be percent-encoded as-is, e.g., \xa0 becomes %A0, not, say, %C2%A0. The spaceToPlus flag controls whether space characters are changed to "+" characters in the result, rather than being percent-encoded. Generally, this is not required, and given the status of "+" as a reserved character, is often undesirable. But it is required in certain situations, such as when generating application/x-www-form-urlencoded content or RFC 3151 public identifier URNs, so it is supported here. The nlChars argument, if given, is a sequence type in which each member is a substring that indicates a "new line". Occurrences of this substring will be replaced by '%0D%0A' in the result, as is required when generating application/x-www-form-urlencoded content. This function is similar to urllib.quote(), but is more conformant and Unicode-friendly. Suggestions for improvements welcome.
To be precise, if a string called `rel` exists such that ``Absolutize(rel, againstUri) == targetUri``, then `rel` is returned by this function. In these cases, `Relativize` is in a sense the inverse of `Absolutize`. In all other cases, `Relativize` returns `None`. The following idiom may be useful for obtaining compliant relative reference strings (e.g. for `path`) for use in other methods of this package:: path = Relativize(OsPathToUri(path), OsPathToUri('.')) If `subPathOnly` is `True`, then this method will only return a relative reference if such a reference exists relative to the last hierarchical segment of `againstUri`. In particular, this relative reference will not start with '/' or '../'.
All values will be strings (possibly empty) or None if undefined. Note that per RFC 3986, there is no distinction between a path and an "opaque part", as there was in RFC 2396.
If the URI reference is given as a Unicode string, then the encoding argument determines how percent-encoded components are interpreted, and the result will be a Unicode string. If the URI reference is a regular byte string, the encoding argument is ignored and the result will be a byte string in which percent-encoded octets have been converted to the bytes they represent. For example, the trailing path segment of u'file:///a/b/%E2%80%A2' will by default be converted to u'\u2022', because sequence E2 80 A2 represents character U+2022 in UTF-8. If the string were not Unicode, the trailing segment would become the 3-byte string '\xe2\x80\xa2'. The osname argument determines for what operating system the resulting path is appropriate. It defaults to os.name and is typically the value 'posix' on Unix systems (including Mac OS X and Cygwin), and 'nt' on Windows NT/2000/XP. This function is similar to urllib.url2pathname(), but is more featureful and produces better paths. If the given URI reference is not relative, its scheme component must be 'file', and an exception will be raised if it isn't. In accordance with RFC 3986, RFC 1738 and RFC 1630, an authority component that is the string 'localhost' will be treated the same as an empty authority. Dot segments ('.' or '..') in the path component are NOT collapsed. If the path component of the URI reference is relative and the attemptAbsolute flag is True (the default), then the resulting path will be made absolute by considering the path to be relative to the current working directory. There is no guarantee that such a result will be an accurate interpretation of the URI reference. attemptAbsolute has no effect if the result is not being produced for the local operating system. Fragment and query components of the URI reference are ignored. If osname is 'posix', the authority component must be empty or just 'localhost'. An exception will be raised otherwise, because there is no standard way of interpreting other authorities. Also, if '%2F' is in a path segment, it will be converted to r'\/' (a backslash-escaped forward slash). The caller may need to take additional steps to prevent this from being interpreted as if it were a path segment separator. If osname is 'nt', a drivespec is recognized as the first occurrence of a single letter (A-Z, case-insensitive) followed by '|' or ':', occurring as either the first segment of the path component, or (incorrectly) as the entire authority component. A UNC hostname is recognized as a non-empty, non-'localhost' authority component that has not been recognized as a drivespec, or as the second path segment if the first path segment is empty. If a UNC hostname is detected, the result will begin with '\\<hostname>\'. If a drivespec was detected also, the first path segment will be '$<driveletter>$'. If a drivespec was detected but a UNC hostname was not, then the result will begin with '<driveletter>:'. Windows examples: 'file:x/y/z' => r'x\y\z'; 'file:/x/y/z' (not recommended) => r'\x\y\z'; 'file:///x/y/z' => r'\x\y\z'; 'file:///c:/x/y/z' => r'C:\x\y\z'; 'file:///c|/x/y/z' => r'C:\x\y\z'; 'file:///c:/x:/y/z' => r'C:\x:\y\z' (bad path, valid interpretation); 'file://c:/x/y/z' (not recommended) => r'C:\x\y\z'; 'file://host/share/x/y/z' => r'\\host\share\x\y\z'; 'file:////host/share/x/y/z' => r'\\host\share\x\y\z' 'file://host/x:/y/z' => r'\\host\x:\y\z' (bad path, valid interp.); 'file://localhost/x/y/z' => r'\x\y\z'; 'file://localhost/c:/x/y/z' => r'C:\x\y\z'; 'file:///C:%5Cx%5Cy%5Cz' (not recommended) => r'C:\x\y\z'
Simply calls MakeUrllibSafe() on the given URL and passes the result and all other args to urllib2.urlopen().
For example, the URN "urn:publicid:%2B:IDN+example.org:DTD+XML+Bookmarks+1.0:EN:XML" will be converted to the public identifier "+//IDN example.org//DTD XML Bookmarks 1.0//EN//XML" Raises a UriException if the given URN cannot be converted. Query and fragment components, if present, are ignored.