Best Practice

1. Best Practice
2. Stylesheet optimisation
3. Named templates
4. Designing for multiple media
5. predicates
6. XSLT Modularisation experience
7. Keys
8. Standards checker for XSLT

1.

Best Practice

Mark Giacobbe

I don't know what applications you have in mind, but if you're using XSL to produce HTML the following conclusions may be of interest, based on recent tests aimed solely at speeding up the XSLT transformation time on some rather complex xsl files. This was using apache Cocoon with Xalan 1.03 and Xerces 1.04. Email me if interested in technical specs.

1. For any "static" html portions of the page (such as headers, footers, nav bars), it's definitely more efficient to store the snippets as external xml files & copy them to the output tree using xsl:copy-of and the document() function, rather than using a named template and xsl:import.

2. (rather obvious)- If you can help it, don't xsl:import any more xsl than you need. For example, we had a 43kb file "library.xsl" with various templates that were used to perform often used functions- format dates differently, create month/day/year drop-down lists, etc. and found that, quite naturally, when we broke this file up into 8 smaller ones, it was faster (up to a certain point) to import just the required portions than the whole thing. However, it was faster to process the entire 43kb file, even with unneeded templates, than to process all 8 atomized files.

2.

Stylesheet optimisation

Mike Kay

Ed: I guess Mike is speaking for Saxon only.

The ones that are relevant here are, I suppose, those that find an algorithm for evaluating an expression that has lower complexity than the naive algorithm.

Some examples:

$x < $y naively requires X*Y comparisons where X and Y are the cardinalities of the two node-sets. Saxon evaluates it as min($x) < max($y) which requires only O(X+Y) operations. It would be possible to optimise $x[not($x > .)] using the same trick, but Saxon currently doesn't.

xsl:for-each with no xsl:sort naively requires to sort the selected nodes into document order which would take O(n log n) operations. Saxon goes to considerable pains to detect cases where the nodes are already naturally sorted, reducing this to O(n).

$x[1] naively requires a comparison on each element of the list to see if its position is equal to 1 (so it would be O(X)). Saxon just looks at the first element, which is O(1).

not($x), where $x is a node-set, naively evaluates the node-set (which is typically O(X), or even more naively O(X log X) if you were to eliminate duplicates eagerly) and then tests to see if it is empty (which can be done in O(1)). Saxon 5.5.1 will usually avoid evaluating the node-set and eliminating duplicates, but not always, it depends how it is defined.

<xsl:number/> naively requires counting of the nodes that precede a given node, which requires O(n) operations; therefore numbering a sequence of n nodes requires O(n^2) operations. Saxon in some cases (but not all) remembers the last node and its number, so that numbering a sequence of n nodes becomes O(n).

I'm afraid that's a very incomplete list, but perhaps it gives a flavour.

Further to this, Jeni Tennison responds

It's particularly enlightening for me to see that the relative complexity of the algorithm does not necessarily mean it's generally worse - it's always a matter of balancing different criteria for a particular problem.

Indeed, I think I'm right in saying that you can have two algorithms that do the same thing in different times but with the same complexity, so Mike's point that:

  num[not(../num &gt; .)][1]

is still 0(n^2) is a comment about how the run-time of the XPath will increase as more nums are added: it doesn't matter how much time it takes or the fact it might stop half way through.

This is one of those conceptual revisions that will take a little time to sink in for me. I'm still struggling to see why:

<xsl:template match="in" mode="find-max">
  <xsl:variable name="greater"
                select="following-sibling::in[. &gt; current()][1]" />
  <xsl:choose>
    <xsl:when test="$greater">
      <xsl:apply-templates select="$greater" mode="find-max" />
    </xsl:when>
    <xsl:otherwise><xsl:value-of select="." /></xsl:otherwise>
  </xsl:choose>
</xsl:template>

is 0(n^2) while:

<xsl:template match="in" mode="find-max">
  <xsl:variable name="max-of-rest">
    <xsl:apply-templates select="following-sibling::in[1]"
                         mode="find-max" />
  </xsl:variable>
  <xsl:choose>
    <xsl:when test=". &gt; $max-of-rest or not(string($max-of-rest))">
      <xsl:value-of select="." />
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$max-of-rest" />
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

is 0(n).

You both recommend looking at a book on algorithms: do you have any good ones that you recommend particularly?

To which David Carlisle responds:


> You both recommend looking at a book on algorithms: do you have any
> good ones that you recommend particularly?

Knuth is the bible, (Art of Computer Programming) but I'm not sure you need a bible. It's been a while since I worked in a CS department. Someone must be able to suggest a good entry level text book (or, cheaper, web site).


> is still 0(n^2) is a comment about how the run-time of the XPath will
> increase as more nums are added: it doesn't matter how much time it
> takes or the fact it might stop half way through.

Yes O(n^2) says that after initial startup costs the algorithm will take 10000 times longer to process input that is 100 times bigger. It doesn't say anything about how big that startup cost is, or how much time each "step" takes.

So if you have a simple algorithm for the same problem that is O(n^3) that doesn't require setting up any initial stuff and runs 1000 times faster (perhaps because it fits in memory rather than swapping to disk) then it may be that on all your real problems this turns out to be quicker than the O(n^2) algorithm, but in the end for big enough problems the fact that increasing the input by a factor of 100 means it takes 1000000 times as long to finish, means that at some point the O(n^2) algorithm is going to win. If you think O(n^3) grows fast, compare an exponential algorithm like factoring: You don't even want to _think_ about how big 10^n gets if you change n by a factor of 100.

>  I'm still struggling to see why:

well in the first one you first get all the elements bigger than the current element, then you recurse. Clearly this gets there in the end but

consider the following list with 1 being current node.

1 2 3 4 5 ....n

step 1:
compare 2 < 1, 3< 1, .... n < 1
construct list 
 2,3,4,,5....n

that's n steps done, and we have a list of length n-1 (just counting number of comparisons, ignoring any time taken to construct node lists, bind variables, etc.

now recurse:
  step 1
   compare 
        3 > 2 , ..... n > 2
   construct list
        3, .....n

   that's another n -1 steps and we have a list of n-2 still to process.

Clearly this process is going to take n recursions, each recursion will take 1 less step so in toto

n + (n-1) + (n-2) + ......2 + 1

which is (n^2 + n)/2

n=1  (1^2 + 1)/2 = 1 = 1
n=2  (2^2 + 2)/2 = 3 = 2 + 1
n=3  (3^2 + 3)/2 = 6 = 3 + 2 + 1
etc

now (n^2 + n)/2 is O(n^2) because we don't care about the /2 as that means it's "twice as fast" but we never specified the time units anyway, and we don't care about the + n because for large n, n is nothing compared to n^2.

The other algorithm was

step 1
find maximum of rest of list (by recursion)
step 2
compare that with current element and use larger of the two.

This time, there is again a recursion of depth n, but each recursion just uses 1 (or 2, depending how you count) steps. So the total number of steps is

1 + 1 + ....  +1 n times. which is n which is clearly O(n).

Its probably worth noting that your first algorithm has optimally bad behaviour in the case the list is already sorted. If the input had been sorted in the other order, it would of course have stopped a lot quicker as the list constructed in the first recursion would have been empty. The "basic" notion of complexity studies worst case behaviour. There is a lot that could be said about algorithms that work well on "typical" input, but that;s a different subject.

Decoding unix crypt encoded passwords has a complexity that should mean that the passwords are secure. But there is another algorithm that has _constant_ O(1) time that works quite well in practice. Get a dictionary of common names and commonly used passwords and try each of those. This algorithm (for a given dictionary) doesn't depend on the size of the input password. This kind of trick lookup is used in practice, compare Mike's description of xsl:number in saxon. The "obvious" thing to do is count whatever you are supposed to be counting, but in his case the system just has the answer to hand, so long as you do the "common" case so it takes effectively no time at all.

3.

Named templates

Jeni Tennison





> Jeni wrote:
>> .... personally, I think that the only time you should use named
>> templates is when the context node, position and size has no
>> influence on the result of the template -- in other words when it's
>> a pure function that computes its results based purely on the
>> parameters that are passed to it. Otherwise, my preference is to
>> use moded matching templates.
>
> I'm rather curious about the logic behind this Jeni. I can see the
> advantage of this class of use, in that it could then be used
> anywhere, in any stylesheet. Is that the rationale?
>
> Your option of using moded matching templates I'm less clear on.
> Lets say I have a named template that performs some function,
> then uses the parent of the current node for something; is this
> the use-case for the moded template?

Well, here is an example of what I think is a bad design:

<xsl:template match="person">
  <xsl:call-template name="getName" />
</xsl:template>

<xsl:template name="getName">
  <xsl:value-of select="givenName" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="familyName" />
</xsl:template>

The getName template uses the context node at the point the template was called in order to resolve the two paths ("firstName" and "surname").

I think that this is bad design because there is no indication in the getName template about what kind of node the context node is. If I'm debugging the getName code I have to look at every call to the template to work out what the context node is in order to work out what the expressions within the template should be.

I think it is much better to use:

<xsl:template match="person">
  <xsl:apply-templates select="." mode="name" />
</xsl:template>

<xsl:template match="person" mode="name">
  <xsl:value-of select="givenName" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="familyName" />
</xsl:template>

Looking at the latter template tells me all I need to know about what the template creates. Nothing from outside the template (such as the context in which it is called) has any influence on the result. This makes it easy to debug.

It also makes the code more modular, both because you can copy the same template into other stylesheets and it will standalone, as you point out, and because it's easier for other people to import and adapt. If <person> elements with a location attribute of "Japan" should display the family and given names the other way round it's easy to add a separate template to do this:

<xsl:template match="person[@location = 'Japan']" mode="name">
  <xsl:value-of select="familyName" />
  <xsl:text> </xsl:text>
  <xsl:value-of select="givenName" />
</xsl:template>

I can see the arguments that for some templates, for example one that creates an XPath to a node, the identity of the context node doesn't really matter, but is just being used as a kind of built-in parameter to the template. Even in these circumstances, I think that the modularity argument makes me lean towards a moded template, but if that's not an issue then I think the template should include a parameter that defaults to the context node rather than using it implicitly. In other words, rather than:

<xsl:template name="createPath">
  <xsl:for-each select="ancestor-or-self::node()">
    <xsl:value-of select="name()" />
    ...
    <xsl:if test="position() != last()">/</xsl:if>
  </xsl:for-each>
</xsl:template>

I prefer:

<xsl:template name="createPath">
  <xsl:param name="node" select="." />
  <xsl:for-each select="$node/ancestor-or-self::node()">
    <xsl:value-of select="name()" />
    ...
    <xsl:if test="position() != last()">/</xsl:if>
  </xsl:for-each>
</xsl:template>

I prefer this because it makes the template more flexible in that it can be passed a particular node to use, which means that rather than having to change the context node with an xsl:for-each:

  <xsl:for-each select="..">
    <xsl:call-template name="createPath" />
  </xsl:for-each>

you can pass a parameter:

  <xsl:call-template name="createPath">
    <xsl:with-param name="node" select=".." />
  </xsl:call-template>

which I find more approachable.

I also prefer the explicit parameter because when I come back to the template I can immediately tell that the context node is used in the template and therefore the results will depend on where the template is called, something that isn't immediately apparent otherwise.

Does that make sense? Anyone think I'm missing something?

4.

Designing for multiple media

David Carlisle



  I'm wrestling with:
  whether I would go with multiple small xml files that I could compile 
  into one output for the print, but would match the html files 
  essentially
  OR
  one xml source file that matches the print, but can be output into 
  multiple files (via some element in the structure that separates) for 
  the web.

It is possible to have both:

have individual files for the bits then rather than pull in lots of files with xslt's document() function, just have a single file that pulls in everything in an appropriate structure. That way you can process as individual documents and at otehr times as one combined doc.

<!DOCTYPE book [
<!ENTITY preface SYSTEM preface.xml">
<!ENTITY chapter1 SYSTEM something.xml">
...
]><book>
  &preface;
  &chapter1;
...
</book>

or, if you prefer

<!DOCTYPE book [
<!ENTITY preface SYSTEM preface.xml">
<!ENTITY sect1.1 SYSTEM something.xml">
...
]>
<book>
  <frontmatter>
  &preface;
  <frontmatter>
<sect>
<head>chapter1 </head>
  &sect1.1;
  &sect1.2;
...
</sect>
<sect>
<head>chapter 2</head>
  &sect2.1;
  &sect2.2;
...
</sect>

</book>

  Anybody experienced one or both of these scenarios and have some 
  pointers?  I have seen some references to doing this, but not much more 
  than that.

I've tried all three approaches in different projects. Actually it doesn't seem to make much difference, some things are slightly easier than others depending on which choice you take, but issues surrounding this choice never seem to really dominate the project, ie making a diferent choice would have affected some of the processing tools, but really any choice can be made to work...

One particular diference is cross linking, If you you use the first or third choice your input XML can use XML ID/IDREF linking as it's all one document, but if your source is separate xml documents you need to use some syntax for external linking. So it depends a bit on how much cross referencing between the parts you expect, and how much help from authoring tools you want for maintaining consistency of the references.

5.

predicates

Michael Kay




> I thought that once you used a predicate ([1]), you couldn't go 
> any further in detail.

A lot of people imagine that. I've no idea why. Applying predicates to intermediate steps in a path expression is one of the most powerful techniques in XPath.

test="preceding-sibling::*[1]/self::transition" works as follows:

1. Select all the preceding sibling elements
2. Select the first of these
3. Select this node if it is a transition element, or nothing if not
4. Convert the result to a boolean: true if the node-set selected in (3)
is non-empty

I usually write this one as

test="preceding-sibling::*[1][self::transition]"

but the result is the same.

Avoid tests like name()='transition' if you can. They are likely to be more expensive, and are less resilient to your choice of namespace prefixes in the source document. The only time to use them is when testing against a variable, e.g. name()=$param, and ideally you should then test local-name() and namespace-uri() rather than testing name().

6.

XSLT Modularisation experience

James Tauber

I've found it particularly useful with documents to go from content-based markup to structure-based markup with one stylesheet and structure-based markup to presentation-based markup with another.

Mike Brown added an excellent approach

This came about answering specifics re scalability etc, but is quite useful as a general approach to the use of XSLT. As a later correspondant put it, pick your tools wisely and remember that just because you have found a shiny new hammer, not every problem becomes a nail. On the other hand there are an awfully large number of nails lying around waiting to be hit :-))

These questions are related and the problems are mostly resolved by changing your view of what you think of as "data" and what you think of as presentational components that belong in a stylesheet.

> Suppose you want to allow people to apply personal preferences to the
> display of a document.  Suppose there are N such preference types.

This is information you can collect and put into XML documents.

> Also you want people to view the doc in different mediums: over the
> web, on their fancy-cell-phone, or whatever.  You have M such media.  It
> seems that now you have to make N*M different stylesheets to accomodate
> these display representations of the same document.  

Components of desired output documents can also be modeled with XML and XHTML. The stylesheets can be devoted mainly to presentation *logic*, and most of the actual presentational content can be obtained from XML/XHTML. The document() function is very, very useful!

> What happens when many of the stylesheets have major similarities?  If you
> want to make a change, how do you avoid having to propogate the changes
> through each stylesheet?

Put that stuff out in XML, not in your stylesheet, and have the stylesheet go look up the information it needs based on user preferences, which is also going to be in the XML.

> What happens when you want to internationalize to L different languages?

Again, have your stylesheet just be code that goes and gets what it needs from the XML/XHTML.

> 4.  Code reuse and refactoring.
>
> The problem with [one template with many conditionals] 
> is that the code gets nasty and unreadable very quickly.
> The problems with [many templates] are that you often replicate code

I prefer many templates. When I need to replicate code, I use calls to named templates, sometimes with parameters if there are slight variations that need to be accounted for. Named templates provide the equivalent of subroutines or private methods.

> and there is no 'otherwise' clause to catch things that fall through
> the tests.

Which tests, the match attributes on xsl:template elements? Say you want to process 'foo' elements, and you want to have one template for when foo's 'name' attribute is 'bar', one template for when it's 'baz', and one template for all other 'foo's:

<xsl:template match="foo[@name='bar']"/>
<xsl:template match="foo[@name='baz']"/>
<xsl:template match="foo"/>

And these are in addition to the built-in template that matches "*" (any element). The templates with the greater degree of specificity will have higher priority for matching.

> Have people been able to move towards dynamically-generated XSL so that some
> of the contextual complexities of the above issues can be automated? 

I would guess that you are not quantifying enough of your data in XML. Presentational data is data, too. Put it in separate XML documents and let your stylesheet use the document() function to access it and merge it all together with your "data data".

At some point, though, you have things that need to be dynamically generated, and for this my solution for now has been to just bite the bullet and write custom templates. But I've abstracted 80% of the commonalities out of these templates and modeled as much data as I can in XML, cutting down the size and redundancy of the custom XSL templates by orders of magnitude.

Now, if you were naughty and neglected to mention that you are restricted to using the stock MSXML that comes with IE5, ignore everything I've said here. :)

7.

Keys

Michael Kay



> Im trying to get better performance by using keys, I have heard thats 
> possible but I dont know how to do.

If you have an expression of the form

//x[@y=$v]

then you can often improve performance by declaring a key

xsl:key name="k" match="x" use="@y"

and changing the expression to

key('k', $v)

8.

Standards checker for XSLT

Bryan Rasmussen et al


> We have a fair bit of XSLT and I would like to start enforcing
> standards. I was wondering if anyone knows of any standards checkers
> for XSLT/XML.

Well, we have this document http://www.oio.dk/files/OIXML_XSLT_Guidebook.pdf which basically has both links to online resources and evaluation of those resources(XSL-T 1.0, somewhat out of date, needs to be updated in evaluation of external resources, plus there are some minor faults in some of the explanations) with some general guidelines for usage of particular elements - e.g. standards we would like to see in transformations expected to work in our infrastructure.

Dimitre suggests

1. Indentation. A stylesheet is an xml document, so the same formatting guidelines apply.

2. Limited lines' width so that the whole code can be read on the screen without horizontal scrolling.

3. Consistent use of naming rules:

-- camelCase notation

-- pParameter, vVariable, ... etc. Hungarian notation examples.

-- the names themselves should express the role of the variable in the transformation.

-- use of blank lines to separate sections of equally indented blocks of code that have separate meaning/roles.

-- vertical allignment of long XPath expressions

4. Preference to XPath 2.0 code over XSLT 2.0 code.

Colin Adams suggests

To see why camelCase is so bad compare: aLongAndRatherUnreadableIdentifier with: an_even_longer_but_perfectly_clear_choice_of_name

(example comes from OOSC 2nd Edition p.880 * Object Oriented Software Construction 2nd Edition. Copyright 1997 by Professor Bertrand Meyer. I recomend every programmer, no matter what language (s)he uses, to read this book at least once.

MK responds

You're stuck with the fact that:

* XSLT names are hyphenated

* XML Schema names are camelCase

* names in the source and target vocabulary might be anything

So for stylesheet-defined names (variables, templates, functions) etc, you can't be consistent with everyone. It's often best to follow the source document vocabulary conventions.