XSLT document order, ordering, sorting

Document Order

1. Understanding document order, node sets and the ancestor axis.
2. Document order
3. What does reverse document order mean

1.

Understanding document order, node sets and the ancestor axis.

Various

Contributors: David Carlisle, Jeni Tennison, Mike Kay, Oliver Becker, Wendell Piez, Ken Holman.

This all started off with a list question asking How could I get the nearest ancestor with a certain attribute, say "attr"? The very simplistic answer is:

(ancestor-or-self::*/@attr)[last()] or ancestor-or-self::*[@attr][1]/@attr

Typically you want to do this because you have "inherited attributes" eg if you want to know the current language, you want to know the value of the nearest xml:lang attribute which is

(ancestor-or-self::*/@xml:lang)[last()]

similarly if you are reading an FO file and want to know current effective value of any FO property.

In order to collate the whole host of answers and comments on this question, I wrote a test harness and stylesheet.

Test Data: I assumed that the sought attribute was a property of an ancestor, rather than the 'home' element itself.

XML file

<?xml version='1.0'?>
<el>
    <el attr="x1">
      <el attr="y1">
	<el>
	  <el attr="x2">
	    <el attr="This one"><!-- This one is sought -->
	      <el>
		<el attr="y">Y2</el>
		<el>
		  <home>Base template match</home>
		</el>
	      </el>
	    </el>
	  </el>
	</el>
      </el>
    </el>
  </el>



XSL Stylesheet:

<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="1.0">
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="/">
    <xsl:apply-templates select="//home"/>
  </xsl:template>


  <xsl:template match="//home">
1    <xsl:value-of select="ancestor-or-self::*[@attr][1]/@attr"/>
2  <xsl:for-each select="ancestor::*[@attr]">
    <xsl:value-of select="@attr"/>
  </xsl:for-each>
3. <xsl:value-of select="ancestor::*[@attr][1]/@attr"/>
4. <xsl:variable name="tmp" select="ancestor::*[@attr]/@attr"/>
   <xsl:value-of select="$tmp"/>
5. <xsl:value-of select="(ancestor::*/@attr)[last()]"/>
6. <xsl:value-of select="ancestor::*[@attr][1]/@attr"/>
7. <xsl:value-of select="ancestor::*/@attr[last()]"/>
  </xsl:template>

  <xsl:template match="*"/>
</xsl:stylesheet>




Each one I tried, I numbered. The following numbers relate to that numbering.

1 <xsl:value-of select="ancestor-or-self::*[@attr][1]/@attr"/> Provides the right one.

2. ancestor::*[@attr] gets you a whole host of stuff..., the last node of which is the sought item!

3. <xsl:value-of select="ancestor::*[@attr][1]/text()"/> gets the right node

4. ancestor::*/@attr gets the first in document order!

5. (ancestor::*/@attr)[last()]

Actually provides the node we were looking for!

6. ancestor::*[@attr][1]/@attr works.

Xpath 4.2 is the reference here.

7. <xsl:value-of select="ancestor::*/@attr[last()]"/> Gives me the last in the reverse document order!

This 7th expression will return a set consisting of the last @attr attribute of each ancestor element. An element can only have one @attr attribute so saying you want its last one is pointless.

An axis returns a list of nodes, which is always in either forwards or reverse document order. This affects the meaning of position() and last() when applied to a predicate used in a Step based on that axis. The final result of a path expression, however, is always a set of nodes, with no intrinsic order. Many operations on a node-set access the nodes in document order. This includes a filter expression, so:

(ancestor::*/@attr)[last()] will select the last in document order (i.e. the innermost available attribute). Which implies that example 5 is returning the correct attribute.

So what *is* the order of nodes returned by ancestor::*/@attr (without the grouping operator)?

XPath expressions select node sets. The distinguishing feature of a set (as opposed to a list) is that it is unordered.

(Most operations on that set take the nodes in document order, but that is a propoerty of the operation being applied, not of the set itself)

Regarding the importance of an axis:

The "order" of an axis only "rules" for the step that it is in, ie it affects the meaning of position() and last() in predicates that occur in that step. It doesn't affect the constructed node set.

The axis direction affects predicates in a step (in both patterns and expressions) and it doesn't affect anything else. It _never_ affects the resulting node set.

(A/B/C[last()])[position()=2]
or
<xsl:variable name="x" select="A/B/C[last()]"/>
<xsl:value-of select="$x[2]"/>
or
<xsl:for-each select="A/B/C[last()]">
 <xsl:if test="position()=2">

all select the 2nd node in document order irrespective of the axis used in C.

The only thing the axis order affects is the direction of that step ie the direction of predicates appearing between C and the next / (if there is one)

Although actually since you've already used [last()] in this step the node list in that step has at most one node so it makes no difference which way you count, but for the record

select="A/B/C[last()][1]">

selects the first (and only) node counding from the front or the back, depending on the axis direction of C.

When you use a predicate on a node set within a step (e.g. //*/@*[1] or foo[1]) then the node set is sorted according to the axis. If it's preceding::, preceding-sibling::, ancestor:: or ancestor-or-self:: then it's a reverse axis, otherwise it's a forward axis. If it's a reverse axis, then it's sorted into *reverse document order* before the predicate is assessed; if it's a forward axis, then it's sorted into *document order* before the predicate is assessed.

When processing a node-set note that:

If you use xsl:for-each or xsl:apply-templates then the nodes are always processed in document order.

Stating the obvious, a node-set is a set (and sets are by definition unordered). select="ancestor::*" produces the _set_ of ancestors, it isn't ordered at all. <xsl:variable name="x" select="ancestor::*"/> puts that set in a variable. <xsl:variable name= "x" select="$x[1]"/> the operation of applying the predicate [1] (ie [position()=1]takes the first node in document order. But that is a feature of the operation not of the node set $x.

Note this is not the same as applying a predicate within a step, which is the only place the axis order has any effect

<xsl:variable name="y" select="ancestor::*[1]"/>

selects the parent, unlike the $x[1] which selects the document element (ie the outer most ancestor)

A node set is always unordered until you do anything with it like iterate over it, get a value from it or whatever. Of course a node set is pretty useless unless you do something with it, so saying a node set is unordered doesn't particularly help :)

It mainly helps in defining | as set union, which is a standard operation on sets. If you wanted to define | on ordered lists you'd have to describe removing of duplicates and say which duplicate was removed, etc.

When you use a predicate on a node set that's already been generated, such as a node-set variable, a union or just by putting brackets around it (e.g. $foo[1], (//*/@*)[1] or (foo|bar)[1]) then the node set is always sorted into *document order* no matter what axes you use.

Since a predicate can be applied to a location path expression, as in the union of two other path expressions, that predicate is always evaluated in document order.

xsl:value-of and xsl:for-each deal with location path expressions (not location step expressions), therefore their evaluation is done in document order.

Mainly of course we are just arguing about words, for the fun of it, but if one was constructing ordered sets I would expect ancestor::* to construct the set of ancestors with reverse document ordering, but it doesn't. It just constructs the set of ancestors. Now given a set there is (in XSLT 1.x) only one possible ordering you can apply, namely document order, so whether that property is intrinsic or applied makes no practical difference, except for the way you describe things. I'd say that <xsl:variable name="x" select=" ancestor::*"/> constructs the set, and that <xsl:for-each select="$x"> steps through that set in document order. You'd presumably say that the variable constructed the node set in document order (as you want the sets to always be ordered), and then for-each does the obvious thing. Now clearly in an implementation a set is probably implemented as some kind of list and so is ordered, so probably the implementation follows your model, but that's an implementation detail, I've been a pure mathematician too long to worry about implementation details:-)

The XSL spec is always very careful to distinguish between "node set" and the "current node list" the latter, being a list and not a set, has an order, and that's what position() refers to. In some places the current node list emerges by magic out of a given node set, but it is still useful to keep the things clearly separated in your (or at least, my) mind.

An axis identifies an (ordered) list of nodes. The predicate associated with the axis is applied to the ordered list. The result of an XPath expression, however, is always a node-set, not a list. The node-set is unordered, but the nodes have an ordering, called document order. Many operations on node-sets process the nodes in document order. Is that clear now?

Drifting slightly off topic, regarding node-sets of attributes,

Attributes of an element aren't ordered in a defined way. The order is implementation dependent. You can't tell, what @*[1] gives. Document order or reverse document order is *only* important in location steps, or to be more specific, in the predicates of a step. If you have an expression that evaluates to a node-set and apply a predicate, then this predicate "filters the node-set with respect to the child axis." (see XPath 3.3)

To bring an even little bit more confusion:

1. Some nodes will have attributes that were not explicitly expressed in the text of the xml document, but were defined/defaulted in a DTD. What will be the "document order" for @* in this case?

DTD implied attributes are no different from explicit ones. The order is defined but implementation specific.

I.e. for a given element node . each time you do @*[2] in a run of the stylesheet you get the same attribute node, as th eattributes (defaulted or not) are placed in some order when the input tree is constructed. But you can't guarantee you get the same result if you run the stylesheet again. (Most likely you will if you use th esame input and same XSL engine, but that is not guaranteed)

Expanding to a nod-set generated from multiple documents:

Nodes from different documents are also in an arbitrary implementation-defined order.

<xsl:for-each select="document('doc1.xml')//* | document('doc2.xml')//*">
   ...
</xsl:for-each>

The attribute:: axis and the namespace:: axis are a bit weird because there is no real concept of 'document order' for attributes or namespace nodes. So, although @*[1] gives you the first attribute in document order, this could actually be any attribute on the element. It rarely makes sense to use a positional predicate with an attribute or namespace axis because attributes and namespaces have no inherent position in an element.

Kens personal view is worth noting:

I find it is more succinct to say: "Nodes from the attribute and namespace axes are always in an arbitrary implementation-defined order, while nodes from all other axes are in proximity order within location step expressions and in document order in location path expressions."

Note that attribute and namespace axes are *not* unordered (you can index into them), just that the order cannot be relied upon. There is an important nuance.

When you say that the order can't be relied on, do you mean:

(a) between two runs of a stylesheet?
(b) between two XSLT processors?

Technically, any of the above. If an implementation doesn't keep it consistent, that shouldn't have any effect on a properly written stylesheet (i.e. one that does not ever rely on the order).

It is totally up to the implementation, so the stylesheet shouldn't make any assumptions whatsoever.

The input to XSL is a tree in which there is a complete order applied to the set of nodes. This ordered input tree does not change during a run.

The "implementation defined" variability is in the way the ordered tree is built from a linear XML syntax. As Mike commented Saxon will give different results with different parsers, as the XSL engine will have effectively been handed different ordered trees. (In fact its probably the XSL code building the tree from sax events rather than really being passed an XPath tree model, but that is an implementation detail) Of course there need never have been any linear input at all. MSXML for example gets its input from a DOM object, that may have been manipulated in memory with attributes added/changed/deleted etc. Again the end result has to be ordered somehow before XSL starts work but it isn't specified how that ordering is done. Once MSXSL gets that tree though, it is fixed for the duration of the stylesheet.

2.

Document order

Ken Holman

The ancestor, ancestor-or-self, preceding, and preceding-sibling axes contain nodes in reverse document order. How is it possible iterate over the nodes in that order when xsl:for-each processes nodes in document order and xsl:sort sorts on string value, not document order?

By "walking" the source tree along the axis in the direction of the first member of the context node list, rather than dealing with the context node list as a whole.

>How can I do something like the following that will order the selected
>nodes in reverse document order?
>
>   <xsl:for-each select="ancestor::node()">
>     <p>Element: <xsl:value-of select="name()"/></p>
>   </xsl:for-each>

My example below illustrates how.

xml file:

<?xml version="1.0"?>
<test>
<a>This is a
<b>This is b
<c>This is c
<d>This is d
<e>This is e</e>
</d>
</c>
</b>
</a>
</test>

Stylesheet

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Tranform"     
     version="1.0">

<xsl:template match="/">                    <!--root rule-->
    <xsl:apply-templates mode="start" select="//e"/>
</xsl:template>                          

<xsl:template mode="start" match="*">
    <xsl:text>&#xa;Document Order:</xsl:text>
    <xsl:for-each select="ancestor::*">
      <xsl:text>&#xa;Element: </xsl:text>
      <xsl:value-of select="name()"/>
    </xsl:for-each>

    <xsl:text>&#xa;&#xa;Axis Order:</xsl:text>
    <xsl:apply-templates select="ancestor::*[1]"/>
</xsl:template>                                                

<xsl:template match="*">
    <xsl:text>&#xa;Element: </xsl:text>
    <xsl:value-of select="name()"/>
    <xsl:apply-templates select="ancestor::*[1]"/>
</xsl:template>

</xsl:stylesheet>

Output:

Document Order:
Element: test
Element: a
Element: b
Element: c
Element: d

Axis Order:
Element: d
Element: c
Element: b
Element: a
Element: test

Or, From Tony Graham!

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Tranform"     
    version="1.0">

<xsl:template match="/">      <!--root rule-->
    <xsl:apply-templates mode="start" select="//e"/>
</xsl:template>                          

<xsl:template mode="start" match="*">
    <xsl:text>&#xa;Document Order:</xsl:text>
    <xsl:for-each select="ancestor::*">
      <xsl:text>&#xa;Element: </xsl:text>
      <xsl:value-of select="name()"/>
    </xsl:for-each>

    <xsl:text>&#xa;&#xa;Axis Order:</xsl:text>
    <xsl:apply-templates 
           select="ancestor::node()[1]" mode="ancestors"/>
</xsl:template>                                                

<xsl:template match="/" mode="ancestors">
    <xsl:text>&#xa;Root</xsl:text>
</xsl:template>

<xsl:template match="*" mode="ancestors">
    <xsl:text>&#xa;Element: </xsl:text>
    <xsl:value-of select="name()"/>
    <xsl:apply-templates select="ancestor::node()[1]" 
                       mode="ancestors"/>
</xsl:template>

</xsl:stylesheet>

            

David Carlisle adds

There are two possible meanings for The order in which nodes are added to the result tree and I don't know which you mean.

One is the `spatial order' ie which node becomes the first child, which the second, etc. This ordering is (I think) totally determined in XSLT by the document order in the input and the templates used. Different processors only produce different results because the input nodes may be ordered differently (attributes, as much discussed, but also nodes from different document() calls are in system dependent order) but once the input is determined then the output for a given stylesheet is specified.

The ordering that I was trying to stress was different and implementation dependent is the `chronological' order in which nodes are added to the result tree. That is when the actual templates get fired. This may happen in any order, or all at the same time. For a pure XSL stylesheet the end result does not depend on this `processing order' as there are no side effects (that's the point about having no side effects) but if you use extensions then they typically do (unfortunately) and so then you may get very different results on different systems.

There is one xsl instruction that creates a side effect: xsl:message writing to the terminal lets you see the `chronological' order of template execution, and whether variables are evaluated at the point of definition or the point of use and several other things that may and do differ between systems. But note this doesn't affect the result tree.

3.

What does reverse document order mean

David Carlisle

Note that node sets are called sets (rather than lists) on purpose, not by accident. They are intrinsically unordered.

[1] in a step selects the first node in the direction specified by the axis used in the step, but anywhere else it refers to document order.

so

preceding-sibling::elt[1]
  selects the immediately preceding elt sibling,
but
(preceding-sibling::elt)[1]
  selects the first elt sibling in document order.

You did 
 <xsl:variable name="sibling-preceders" select="preceding-sibling::elt"/>
   <xsl:for-each select="$sibling-preceders">
   <xsl:value-of select="."/> <xsl:if test="position()&lt;last()"> 

which is like

(preceding-sibling::elt)[position() &lt; last()]

so position() and last() refer to document order here.

To see _why_ it has to be that way (other than because that was the will of the w3c)

consider

 <xsl:variable name="sibling-preceders" 
    select="preceding-sibling::elt|../*[@id]"/>

now the variable contains all the preceding elts (which were "collected backwards" and all sibling elements with an id attribute (which were collected forwards) note these two sets overlap but | is set union, an elt with an id is selected by both clauses, but only appears in the node set once. So a constructed node set is justa set of nodes, each individual node doesn't "remember" what axis was used to select it.

So if you caome later to apply a position() construct to the node set, at that point the set is considered in document order.