uri usage in xslt. A mess

URI related

1. Uri specifications for xslt
2. file access
3. URI form when file is used
4. Referring to files in a URL with Xalan
5. Encoding URLs containing spaces.
6. Specifying a drive in an URI under Windows
7. Escaping URI's
8. Uri specification
9. utf-8 in uri's
10. Use of \ in url

I've run a few variations through a number of processors. The results are here. DaveP.

An xml-dev posting on this topic pointed me to this url which seemingly is still a draft. As Rick says:

which was released at the same time as RFC3986.

As it says, the issue of the exact syntax of the file: scheme is unresolved.
I think all we can do is know the range of tricks. Whenever I teach an
XML-related course, I always try to make the point "The thing you type in
the address box of your browser is not a URL" and let them know about this
kind problem with file:, because sooner or later they will probably have
to face it.

1.

Uri specifications for xslt

DaveP

I tested a number of options with a number of processors, (see above) the results are here.

Conrad mailed me with another informal usage:

There is another variant - this one from Netscape It allows file:///C|/

i.e. pipe (|) instead of colon (:) after the drive.

2.

file access

Jean-Baptiste Quenot

AFAIK, the convention is not « file:/// » but « file:// ». The third slash is only required on UNIX systems, in order to denote the root directory.

Marc-Aurèle DARCHE fills in the picture.

For more information on this, I would suggest to refer to the RFC "Uniform Resource Identifiers (URI): Generic Syntax" or RFC "Uniform Resource Locators (URL)"

Here are some chosen pieces :

The URI syntax is dependent upon the scheme. In general, absolute URI are written as follows:

<scheme>:<scheme-specific-part>

An absolute URI contains the name of the scheme being used (<scheme>) followed by a colon (":") and then a string (the <scheme-specific-part>) whose *interpretation depends on the scheme*.

The URI syntax does not require that the scheme-specific-part have any general structure or set of semantics which is common among all URI. However, a subset of URI do share a common syntax for representing hierarchical relationships within the namespace. This "generic URI" syntax consists of a sequence of four main components:

<scheme:>//<authority><path?><query>

A file URL takes the form:

file://<host>/<path>

where <host> is the fully qualified domain name of the system on which the <path> is accessible, and <path> is a hierarchical directory path of the form <directory>/<directory>/.../<name>.

For example, a VMS file

DISK$USER:[MY.NOTES]NOTE123456.TXT

might become

<URL:file://vms.host.edu/disk$user/my/notes/note12345.txt>

As a special case, <host> can be the string "localhost" or the empty string; this is interpreted as `the machine from which the URL is being interpreted'.

BNF

fileurl        = "file://" [ host | "localhost" ] "/" fpath

So the following are correct on Unix:

file://localhost/home/user/data/fileResource
file:///home/user/data/fileResource

And I can only guess, and cannot test, for Microsoft Windows:

file://localhost/C:/somedir/anotherone/fileResource

David Carlisle and Peter Flynn expand on this:

Note that file:///D:\test.xml and file:/D:\test.xsl both appear to work on Microsoft platforms but both of those are incorrect. IE actually corrects that user error for you and changes \ to / but it is not obliged to.

So if you use \ you have no reason to expect relative links to work. If the system changes \ to / for you as a silent error recovery, that's possibly nice of it, but you shouldn't rely on it.

  file://mikesmachine.iclway.co.uk/usr/docs/x.xml

is OK, or, if you don't want to name the machine:

file://localhost/usr/docs/x.xml

or

file:///usr/docs/x.xml

For Linux, with setup of xt (jdk1.3), RedHat 6.2, only file:///name.ext works. Cocoon (xalan,xerces), RedHat 6.2, ditto.

3.

URI form when file is used

Tom Passin



> > file:///c:\servers\mserver\ojamc2vpt1_part.xsd
> > file:c:\servers\mserver\ojamc2vpt1_part.xsd
> > file:c:/servers/mserver/ojamc2vpt1_part.xsd
> >
> > There is some ambiguity in the RFC about the exact form of file: URLs,
> > leading to different interpretions, and different processors can require
> > different forms.  The form you used works on some processors, but I forget
> > what works with which.
>
> It should be noted that all of the forms above adress
> different locations. The component separator for
> file: URLs is "/", not "\".

It is if the URL is considered transparent, but if it is to be considered opaque then the server gets do decide how to decode it, and the use of "/" is not required. The RFC does not say whether the file: scheme is supposed to use transparent or opaque URLs, and there have been several different interpretations as well as outright errors (like allowing file:c:) in actual implementations. Sometimes an application will accept one thing from the command line and another in the document() function, too. That is why you have to try various forms if you think you might be having this problem.

> Which actually doesn't
> matter if the whole stuff is passed down to the OS
> routines, but still...

But you can not always be sure where the URL gets parsed, and it is not always by the OS.

> BTW from within applications, Windows accepts both
> "/" and "\" as directory separators.

Windows Explorer does for local files, but try to use a network host name on an NT network with forward slashes and you will find it does not work. For example, try to open a file using the standard Windows file open dialog on a computer named "computer_a" using "\\computer_a\" and it will work, but not if you use "//computer_a/". Windows is not consistent about forward vs. reverse slashes.


> Furthermore the last two forms are formally relative
> URLs (despite being absolute file system paths) and
> could be accidentally passed to an URI resolver.
>

Some libraries insist on that form anyway.

> Bottom line: the original form is preferable in
> any case and should also work for every major XSL
> processor.
>

"Should" and "actually does" are not always the same, especially in the case of "file:" URLs. For example, Sablotron 0.70 requires "file://", not file:///, although "file://" is clearly an error by any interpretation. I think that "file:\\\" is rarely required but it is well to try it if all else fails to work. Actually, there is a good degree of tolerance for variant forms these days, especially among the major processors (probably, as you say, because the OS accepts several variations), but depending on the library in use you may get unlucky.

4.

Referring to files in a URL with Xalan

Norm Walsh

> The question is, how do I refer to local file:// type URLs when using Xalan?

With file:///path/to/file.

5.

Encoding URLs containing spaces.

Mike Brown


> In my demo.xsl, I have the following fragment
> <a href="getDemo.xml?Id={@Account}">click</a>
> 
> where @Account is holding onto 'Hello World'
	

The output HTML contains

   href="getDemo.xml?Id=Hello World"

And you should note that the attribute value here is not a legal URI. Raw space characters are not allowed in URIs.

> When I use IE 5, Netscape 6, or Opera 5 this is what
> is sent via the http call:
> 
> http://localhost/demo/getDemo.xml?Id=Hello%20World

Then IE 5, Netscape 6 and Opera 5 are being very nice to translate your unescaped space for you. They are under no obligation to do that.

> When I use Netscape 4.7 this is what is sent via the http call
> 
> http://localhost/demo/getDemo.xml?Id=Hello World

Netscape 4.7 is not doing anything wrong, although one could argue that since an href value must be a URI, any disallowed characters in the value should be considered part of the URI and should be escaped so as not to have this situation. But there is no requirement that this actually happen.

You should just translate the spaces in the first place. If you are sure that your server will accept '+' instead of ' ', you might try using the translate() function:

<a href="getDemo.xml?Id=
	  {translate(@Account,' ','+')}">click</a>

or, the way I would do it, would be to do a string replacement of ' ' with '%20':

<a>
  <xsl:attribute name="href">
    <xsl:text>getDemo.xml?Id=</xsl:text>
    <xsl:call-template name="replace">
      <xsl:with-param name="stringIn" select="@Account"/>
      <xsl:with-param name="charsIn" select="' '"/>
      <xsl:with-param name="charsOut" select="'%20'"/>
    </xsl:call-template>
  </xsl:attribute>
  <xsl:text>click</xsl:text>
</a>

The named template is below.

<xsl:template name="replace">
  <xsl:param name="stringIn"/>
  <xsl:param name="charsIn"/>
  <xsl:param name="charsOut"/>
  <xsl:choose>
    <xsl:when test="contains($stringIn,$charsIn)">
      <xsl:value-of
select="concat(substring-before($stringIn,$charsIn),$charsOut)"/>
      <xsl:call-template name="replaceCharsInString">
        <xsl:with-param name="stringIn"
select="substring-after($stringIn,$charsIn)"/>
        <xsl:with-param name="charsIn" select="$charsIn"/>
        <xsl:with-param name="charsOut" select="$charsOut"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$stringIn"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

6.

Specifying a drive in an URI under Windows

Steve Muench


>Aren't there any standard URI syntax supporting Windows file system?


Try 
file:///D:/foo/bar/baz.xml

With the three slashes.

7.

Escaping URI's

Wesley W. Terpstra

The attached xsl code should work with utf-8 for URIs since the RFC guarantees for non-ascii characters both the xslt engine AND the browser will utf-8 encode then hexify when seeing non-ascii in a uri.

However, this means that this only works for URIs; one can't use translate(..., '%'. '=') on the output and expect it to work. (which is what I was wanting since I want to output email headers with utf-8 encoding)

For this case, I automatically detect if a correct escape-uri(...) function is present in the xslt engine. If so, we use that and therefore the translate trick works since escape-uri will utf-8 and hexify the passed string.

If there is no escape-uri to deal with utf-8, it will default to outputing ?s instead of the real characters. This can be overridden to use the uri hack outlined above.

This code obeys the RFC to the letter. (I even deal with the nasty % case)

The entry point to the xsl code is:

<xsl:call-template name="my-escape-uri">
 <xsl:with-param name="str" select="'the url here'"/>
 <xsl:with-param name="allow-utf8" select="true()"/>
</xsl:call-template>

--------------
escape-uri.xsl
--------------

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<xsl:stylesheet 
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns="http://www.w3.org/1999/xhtml"
    version="1.0">

<xsl:output method="html" indent="no" encoding="UTF-8"
    doctype-system="http://www.w3.org/TR/html4/strict.dtd"
    doctype-public="-//W3C//DTD HTML 4.0 Transitional//EN"/>

<!-- Escape URIs -->

<xsl:variable name="uri-input">The@dog-z
went/_&#127;_%ab%zu%c</xsl:variable>
<xsl:variable
name="uri-output">The%40dog-z%20went%2F_%7F_%ab%25zu%25c</xsl:variable>
<xsl:variable name="have-escape-uri" select="escape-uri($uri-input, true())
= $uri-output"/>

<!-- According to the RFC, non-ascii chars will be utf-8 encoded and escaped
     with %s by the xslt-engine when in a 'uri' attribute or by the browser
     if the xlst-engine doesn't. This is ok, but not enough since we still
     won't have working RFC822 (email) Froms! since they need =s. 
     However, as there is nothing I can do about this, I will just hope for
     the best if an xslt engine doesn't have uri-escape. 

note the linebreaks \ at end of line only! (DaveP) are there to
enable a moderate display. Concatenate them to one line.


-->
<xsl:variable name="ascii-charset">
!&quot;#$%&amp;&apos;()*+,-./0123456789:;&lt;=&gt;?\
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~&#127;
</xsl:variable>
<xsl:variable
name="uri-ok">-_.!~*&apos;()0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ\
  abcdefghijklmnopqrstuvwxyz</xsl:variable>
<xsl:variable name="hex">0123456789ABCDEFabcdef</xsl:variable>

<xsl:template name="do-escape-uri">
 <xsl:param name="str"/>
 <xsl:param name="allow-utf8"/>

 <xsl:if test="$str">
  <xsl:variable name="first-char" select="substring($str,1,1)"/>
  <xsl:choose>
   <xsl:when test="$first-char = '%' 
	  and string-length($str) &gt;= 3 and
	  contains($hex, substring($str,2,1)) and 
	  contains($hex, substring($str,3,1))">
    <!-- The percent char is ok IF it followed by a valid hex pair -->
    <xsl:value-of select="$first-char"/>
   </xsl:when>
   <xsl:when test="contains($uri-ok, $first-char)">
    <!-- This char is ok inside urls -->
    <xsl:value-of select="$first-char"/>
   </xsl:when>
   <xsl:when test="not(contains($ascii-charset, $first-char))">
    <!-- Non-ascii output raw based on utf8 allowed or not -->
    <xsl:choose>
     <xsl:when test="$allow-utf8">
      <xsl:value-of select="$first-char"/>
     </xsl:when>
     <xsl:otherwise>
      <xsl:text>%3F</xsl:text>
     </xsl:otherwise>
    </xsl:choose>
   </xsl:when>
   <xsl:otherwise>
    <!-- URL escape this char -->
    <xsl:variable name="ascii-value"
select="string-length(substring-before($ascii-charset,$first-char)) + 32"/>
    <xsl:text>%</xsl:text>
    <xsl:value-of 
	  select="substring($hex,floor($ascii-value div 16) + 1,1)"/>
    <xsl:value-of select="substring($hex,$ascii-value mod 16 + 1,1)"/>
   </xsl:otherwise>
  </xsl:choose>
  
  <xsl:call-template name="do-escape-uri">
   <xsl:with-param name="str" select="substring($str,2)"/>
   <xsl:with-param name="allow-utf8" select="$allow-utf8"/>
  </xsl:call-template>
 </xsl:if>
</xsl:template>

<xsl:template name="my-escape-uri">
 <xsl:param name="str"/>
 <xsl:param name="allow-utf8"/>
 
 <xsl:choose>
  <xsl:when test="$have-escape-uri">
   <xsl:value-of select="escape-uri($str, true())"/>
  </xsl:when>
  <xsl:otherwise>
   <xsl:call-template name="do-escape-uri">
    <xsl:with-param name="str" select="$str"/>
    <xsl:with-param name="allow-utf8" select="$allow-utf8"/>
   </xsl:call-template>
  </xsl:otherwise>
 </xsl:choose>
</xsl:template>

<xsl:template match="/">
 <html>
  <xsl:call-template name="my-escape-uri">
   <xsl:with-param name="str" 
	  select="'The@dog-zwent/_&#255;&#127;_%ab%zu%c'"/>
  </xsl:call-template>
 </html>
</xsl:template>

</xsl:stylesheet>

8.

Uri specification

Mike Brown


J.Pietschmann wrote:
> The component separator for file: URLs is "/", not "\".

Thomas covered this pretty thoroughly, and contributed significantly to related threads on other lists, but I want to reiterate the point that the statement above is *not* a rule.

Everything after the scheme in a file: URL is OS-dependent by definition. (Thanks, Netscape, for bestowing yet another abomination upon us). The format is:

"file:" + an OS-dependent path, properly escaped

Ultimately, there are *no* assumptions you can make about what comes after the scheme, if you don't know the OS the URL is associated with.

Examples of paths on different OSes that make life difficult:

Mac (before OS X)           Windows/DOS                     UNIX
=========================   ==================== =================
(no equivalent)             \abs\path\to\file    /abs/path/to/file
drive:abs:path:to:file      DRIVE:\abs\path\to\file         (no equivalent)
:rel:path:to:file           rel\path\to\file                rel/path/to/file
::foo:rel:path:to:file      ..\foo\rel\path\to\file         ../foo/etc
:::foo:rel:path:to:file     ..\..\foo\rel\path\to\file       ../../foo/etc
(no equivalent)             \\Host\Share\abs\path\to\file   (no equivalent*)

  * Well, sorta. Things like NFS, SMBFS, etc. complicate matters a bit

The URI resolver shared by Windows Explorer and Internet Explorer is by far the most forgiving. Its ability to handle pretty much anything you throw at it should not be considered evidence of the equivalence of different kinds of path components, or of there being a canonical format for the paths in file:

9.

utf-8 in uri's

Mike Brown




> the problem is that I passed as 
> parameter one two-bye 
> character and each byte of those two was transformed again 
> into two-byte 
> characters giving in result four bytes i.e., two two-byte 
> characters.

> Orginal character was %C5%82 and the result was &Aring; - 
> one character and 
> &#130; - second character :(

Everyone else seems to have missed your point. You are running into an issue with an underspecified part of the URI, HTTP and HTML specs: there is no standard mechanism for declaring what encoding is being used when representing non-ASCII characters (x80 and above) in the %-escaped format used in URIs and HTML form data submissions.

Tomcat interprets %C5%A2 in the HTTP request as bytes C5 A2, and exposes them through the Java/JSP API as 2 chars in a String according to an assumed (and probably wrong) iso-8859-1 encoding.

On the receiving end, you must convert these chars back into bytes, assuming iso-8859-1, and then convert them to a String again, this time assuming UTF-8. I did this in JSPs with WebLogic a while back, and it was pretty straightforward. I'm not sure how it works with your particular Tomcat/Cocoon setup, though.

10.

Use of \ in url

David Carlisle



> '\'  by itself is not prohibited in URLs, is it?
no, but it doesn't mean what people think it means (it has the same status
as "a")


> Is "http://example.com/\data\file.xsl" an invalid URL? 

No, but it has a _single_ path component called "\data\file.xsl" so if that file has an <xsl:include href="foo.xsl"/> then foo.xsl is a relative uri that corresponds to "http://example.com/foo.xsl" which probably is not what was intended.

If the same file is served from "http://example.com/data/file.xsl" then the relative foo.xsl uri will resolve to "http://example.com/data/foo.xsl"

Note that even if the server tries to be kind and silently map \ to / (as it may do as it is free to map uris to its file system in any way it likes) then it will still fail as a _client_ given that relative URI is mandated to ask for http://example.com/data/foo.xsl as it is the client that resolves the relative uris and requests an absolute uri from the server.


> Putting it another way, I think it is legal to write

> <xsl:import href="module1\stylesheet1.xsl"/>

its legal but the base uri of the included file might not be what you expect.

You can see this in action.

If you try http://www.nag.co.uk/numeric/fl/manual/html/FLlibrarymanual.asp

and click on the link to "Forword" you'll follow a relative link to a pdf file with a Forword to the NAG library (all these docs are generated with xslt by the way, if you wonder why I hang out here:-)

If you try http://www.nag.co.uk/numeric\fl\manual\html\FLlibrarymanual.asp

Then it depends what you try it with.

IE silently corrects your input to the first form, and everything works as before.

Mozilla (also on windows) changes the location bar to http://www.nag.co.uk/numeric%5Cfl%5Cmanual%5Chtml%5CFLlibrarymanual.asp

Our (Microsoft) server accepts that and serves the same page, but now none of the relative links work, and all give 404's

The fact that IE changes the user input in its location box is a matter for IE, this is a user input issue, but once the string has been entered and is being passed between different api it's not at all clear whetherthat kind of user-error correction is realy allowed and teh mozilla behaviour will be far more typical.