XSLT DOE, disable output escaping, escaping

Disable Output Escaping

1. What does DOE do?
2. DOE History
3. Valid use of DOE
4. HTML in variables
5. Avoiding doe
6. Disable Output escaping
7. Printing out variables with escaped text.
8. Disabling output escaping
9. Disable output escaping or CDATA section?
10. Valid use for DoE
11. Can you think of a sensible use of "disable-output-escaping"?
12. Special character handling... text output

1.

What does DOE do?

Michael Kay

Think of it this way: parsing the data turns angle brackets into nodes, and < into angle brackets. You can think of this as unescaping. Serializing the data turns nodes into angle brackets, and angle brackets into < You can think of this as escaping. In your transformation you want to turn < into angle brackets which means you need to unescape once more often than you escape. This is what disable-output-escaping does.

2.

DOE History

Michael Kay

XSLT 1.0 as originally published says:

"It is an error for output escaping to be disabled for a text node that is used for something other than a text node in the result tree. "

Note this concept of "the result tree" - XSLT 1.0 is based very much on the idea that there is a single result tree. Many vendors, not all, interpreted this to mean that it was an error to use d-o-e when writing a text node to a result tree fragment.

Erratum E2, however changed this, and stated explicitly that you could use d-o-e when writing to a variable containing a result tree fragment. This was known in the WG as the "sticky d-o-e" debate: specifically, d-o-e is effectively a boolean attribute attached to each character in a result tree fragment, which stays with it when the fragment is copied.

This change caused a lot of grief when it came to defining XSLT 2.0, which abolished result tree fragments and turned them into fully-functional temporary trees. To preserve the stickiness of d-o-e, it would have been necessary to explicitly add this bit-per-character to the data model and explain how it was propagated through the wide range of operations that were now permitted on nodes in temporary trees.

In 2.0, d-o-e is deprecated, and processors are not required to implement it at all. Using d-o-e when writing to a variable is covered by this sentence:

"If output escaping is disabled for an xsl:value-of or xsl:text instruction evaluated when temporary output state is in effect, the request to disable output escaping is ignored."

(Temporary output state is in force when evaluating variables or functions and in other similar situations).

This is specifically listed as a known incompatibility: clause 20 of J.1.4 states:

"An erratum to XSLT 1.0 specified what has become known as "sticky disable-output-escaping": specifically, that it should be possible to use disable-output-escaping when writing a node to a temporary tree, and that this information would be retained for use when the same node was later copied to a final result tree and serialized. XSLT 2.0 no longer specifies this behavior (though it permits it, at the discretion of the implementation)."

(But I think the final phrase in parentheses is incorrect: it does not permit it. Section 20.2 takes priority, because it is normative, whereas Appendix J is non-normative.)

3.

Valid use of DOE

Jeni Tennison

Given:

<description>Here's a Google search for a famous phrase, &lt;a
href="http://www.google.com/search?q=to+be+or+not+to+be";>to be or not to
be&lt;/a>; give it a try and see what happens. 
... </description>

1. its utf-8 encoded.

2. The output is to be html.

Jeni responded

As usual, the easiest recourse here is to use DOE (this is what it's designed for):

<xsl:template match="description">
  <xsl:value-of select="." disable-output-escaping="yes" />
</xsl:template>

> Is this stopping a 'double' escaping of the entity?

The text within the <description> element is just text. If you have:

  <description>blah &lt;bold>blah&lt;/bold> blah</description>

then the value of the <description> element is the string:

  blah <bold>blah</bold> blah

Note that this is a *string*, not a tree. The XSLT processor can't recognise escaped markup in text automatically. The '<bold>' is just the characters '<', 'b', 'o', 'l', 'd', '>' as far as the XSLT processor is concerned, not a tag.

When an XSLT processor serialises a string as XML (or HTML) then it escapes any significant characters (e.g. < and &) using the normal escapes. So if you did:

  <xsl:value-of select="description" />

and you were generating XML or HTML then the XSLT processor would escape the < characters that it sees in the string value of the <description> element, and the output that you'd see would be:

  blah &lt;bold>blah&lt;/bold> blah

What disable-output-escaping does is to disable this output escaping -- it stops the processor from escaping the characters that are significant in XML. So you'd get:

  blah <bold>blah</bold> blah

in the output. A browser will then recognise the <bold> as a tag and behave accordingly.

Some processors have a parsing extension function (e.g. check out saxon:parse()) that would allow you to interpret the text inside the <description> element as XML, which would then allow you to do:

  <xsl:copy-of select="saxon:parse(description)" />

but note that the content has to be well-formed to do that.

In XSLT 2.0, you should use character maps rather than DOE.

>> Or complain to the feed until they give you XML rather than this
>> hybrid.
>
> It is XML... strictly speaking? But agreed its a mess.

What I meant was that the content of the <description> element is just text, not XML -- it doesn't have any structure that's recognisable to an XML parser, but you want it to have.

4.

HTML in variables

Michael Kay



> <xsl:variable name="italicOpen">&lt;i></xsl:variable>

I think that in general double-markup (markup disguised as text) is a bad idea, because it's very confusing and not well supported by tools. It's much better wherever possible to exploit the fact that XML is fully hierarchic, so markup can always be nested.

However, the problem does come up quite often. Sometimes its done for bad reasons, but there are also some plausible reasons:

(a) the inner markup is HTML and is not well-formed-XML

(b) the inner markup represents a complete XML document containing a DTD internal subset

The process of turning angle brackets into nodes is called parsing. Parsing also turns an &lt; escape sequence into an angle bracket. If nested markup like this is to be manipulated by XSLT, then it needs to be turned into nodes. To get from &lt; via < to a node you need to parse it twice: you can do this using an extension such as saxon:parse().

The reverse of parsing is serialization. During serialization, nodes are turned into angle brackets, and angle brackets are turned into escape sequences (entities).

You want the &lt; in the input to become < in the output. This means you either need to parse it twice and serialize it once, or you need to parse it once and bypass the usual action of the serializer - which is what disable-output-escaping does.

Disable-output-escaping is generally derided because it's so frequently misused by beginners who haven't understood that XSLT is dealing with trees. But there are use cases for it, and a prime one is extracting HTML documents or fragments that have been wrapped in an XML wrapper. It's very problematic architecturally because it distorts the interface between the transformer and the serializer, and that's why not every processor supports it. However, there are cases where it's the best solution available.

5.

Avoiding doe

Wendell Piez

Mike Kay said.

 > >The real reasons are (a) that if you mix metaphors by 
 > generating part-tree
 > >and part-sequential-file, then your transformations are 
 > never going to be
 > >composable and reusable, and (b) that you'll never learn the 
 > power of the
 > >language if you start out by misusing it this way.

To reiterate and emphasize what Mike says, these two reasons for avoiding disable-output-escaping are deeply entangled with one another. Writing your markup as literals using d-o-e treats markup as if it were text, thereby breaking the basic paradigm of the structured markup approach to text processing, which is the source of its strength (as well as, occasionally, its Achilles' heel): the distinction between text (data) and markup as (despite appearances) fundamentally two different things.

It is significant that people most often try the d-o-e approach when they are either (1) still naive about the way XSLT works, and projecting mistaken expectations onto it or (2) trying to build an application whose concept or design is already threatening the markup/text distinction, such as handling ill-formed HTML markup wrapped in XML, or trying to do "metamarkup" in some way (marking up markup). (1) is curable, and (2) may sometimes be defensible, but it's often difficult to sort out where a given developer is, and you might be guilty of both.

A third case such as Rick Geimer's writing the internal declaration subset is the exception that proves the rule. Here, we'd like to write the output as markup, but since XSLT doesn't give us facilities to manage and output declarations as markup (strictly the DOCTYPE declaration with its contents ought to be markup), one has no choice but to fall back on pretending it's text.

6.

Disable Output escaping

David Carlisle


Q expansion

> In order to get  <?$tls`> content<?$tle`>
> in my output, what do I need to do in the stylesheet?
> If I use 
> <xsl:text disable-output-escaping><?$tls`></text>
> with the output method set to xml
> I get an error

this is not well formed XML so the XML parser you are using with the xsl system would have rejected this.

you should have used

 <xsl:text disable-output-escaping="yes"> &lt;?$tls`>

If the output was a processing instruction say <?tls?> content<?tle?> Then you wouldn't have to mess with disable-output-encoding at all you could just use

<xsl:processing-instruction name="tls"/>
content
<xsl:processing-instruction name="tle"/>

7.

Printing out variables with escaped text.

Kay Michael

Saxon (by design) supports disable-output-escaping only when writing to the final result tree, it is ignored when writing to a result tree fragment.

The spec does mention the possibility of d-o-e being retained as a property of text nodes on a RTF and presumably copied when the RTF is copied, but like the rest of the spec in this area, it's very much at implementor option.

There are considerable problems in supporting d-o-e in a stored tree, especially once you support the node-set() extension, for example you need to be able to merge adjacent text nodes some of which have d-o-e and others not.

8.

Disabling output escaping

Steve Muench and others.


>What should the following produce?


<xsl:stylesheet version="1.0" 
      xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="/" >
    <xsl:text 
      disable-output-escaping="yes"><![CDATA[<tr>]]></xsl:text>
  </xsl:template>

</xsl:stylesheet>


With Saxon, Xalan  and Oracle it produces your expected '<TR>'.

9.

Disable output escaping or CDATA section?

David Carlisle

disable-output-encoding only has an effect on the output, it does not affect the input syntax, a < must still be escaped using &lt; or CDATA.

Conversely <![CDATA only quotes < and & in the input: it has no effect on the output, <![CDATA[ < ]]> will be output as &lt; unless d-o-e is used.

Jeni expands this:

Using a CDATA section in the stylesheet and getting a CDATA section in the result are unrelated.

A CDATA section in the stylesheet is interpreted in just the same way as a text node in which all the less-than signs and ampersands had been escaped. So you could use:

<xsl:template match="/">
  <html>
    <head>
      <script language="JavaScript">
        function twiZone(Node){
          if(Node &lt; 1){
            alert("This one.");
          }
        }
      </script>
    </head>
    <body onload="twiZone(3)">

    </body>
  </html>
</xsl:template>

and the stylesheet will appear in exactly the same way to the XSLT processor as if you used:

<xsl:template match="/">
  <html>
    <head>
      <script language="JavaScript">
        <![CDATA[
        function twiZone(Node){
          if(Node < 1){
            alert("This one.");
          }
        }
        ]]>
      </script>
    </head>
    <body onload="twiZone(3)">

    </body>
  </html>
</xsl:template>

The same goes for CDATA sections in the source document -- they're simply not part of XSLT's data model.

Using the cdata-section-elements attribute on xsl:output, on the other hand, affects how text nodes are output -- whether the less-than signs and ampersands that they contain are escaped individually or whether the entire text node is wrapped in a CDATA section. In either of the above, if the script element's text nodes should be wrapped in CDATA sections, you'll get:

      <script language="JavaScript">
        <![CDATA[
        function twiZone(Node){
          if(Node < 1){
            alert("This one.");
          }
        }
        ]]>
      </script>

otherwise you'll get:

      <script language="JavaScript">
        function twiZone(Node){
          if(Node &lt; 1){
            alert("This one.");
          }
        }
      </script>

The HTML output method doesn't wrap the script (or style) element's content in a CDATA section, and doesn't escape the less-than signs or ampersands; this is to ensure compatibility with legacy HTML processors.

10.

Valid use for DoE

Michael Kay




> I have a xml file which has some html embedded inside.
> The file is like this:

> <?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet 
> type="text/xsl" href="bcel-fb3.xsl"?> <BugCollection>

> <BugInstance type="URF_UNREAD_FIELD" priority="2"> ..........
> ..........
> </BugInstance>

> <SummaryHTML><![CDATA[<html>
> <body>
> <h1>

> <center>Findbugs Summary Report</center> </h1>

> ..........
> ..........
> </body>
> </html>

> ]]]></SummaryHTML>
> </BugCollection>


> I want to extract the block of HTML code and make a seperate HTML file 
> and create some links of my own inside the new HTML links.

> But I do not want the format of the HTML file to be changed 

This is a legitimate use case for disable-output-escaping:

<xsl:value-of select="SummaryHTML" disable-output-escaping="yes"/>

But all the usual caveats about d-o-e apply: it is not supported by all processors in all environments.

11.

Can you think of a sensible use of "disable-output-escaping"?

David Carlisle

4 off the top of my head:

a) creating almost-but-not-quite XML like some template languages

<% .... %>

(ASP JSP etc)

As that isn't legal XML, XSLT won't generate it without a bit of from d-o-e.

b) Generating a local subset of a doctype, and entity references, if you _really_ have to.

c) In the MathML specification we use CDATA sections for mathml examples. We don't just "inline" the XML which is the usual advice as we want tight control over things like indentation and use of ' or " around attribute values. If you are telling the user that they can do a="2" or a='2' you don't want the system to write them both out the same way:-) In the normative html version of the spec it's no problem you just value-of the example into a <pre> and it all works, but in the XHTML+MathML version we _also_ want to inline it as XML so you get a side-by-side view of the literal XML and how your browser renders it, we use d-o-e to produce this.

d) If you have quoted html inside your xml as in <foo><![CDATA[ a <br> c <img src= "x.png"> jjj ]]></foo> and you need that html in the output. If you can fix your input not to do that it is good but often you can't (RSS feeds etc) and so d-o-e can be used as a method of last resort (although xslt2 offers alternatives, more on that another day perhaps:-)(

12.

Special character handling... text output

David Carlisle


<xsl:output method="text" encoding="windows-1250"/>
                    ^^^^^^^^^^^^^

In text output there are no markup characters, so there are no special characters that need to be escaped (and no way of escaping any character)

<xsl:value-of select="$test" disable-output-escaping="yes"/>

so disable output escaping has no effect. Of course it is (normally) bad style to use it even for xml output.

  Xalan -p test "'test &#225; test'" data.xml stylesheet.xsl produces
   "test &#225; test" 

As expected. The input parameter is a string so it is not parsed as XML so & # 2 5 5 ; is not a reference to a acute but rather the 6 characters ampersand hash 2 5 5 semicolon.

On output those characters are just output as themselves in text mode, there is no escaping being done.

If you need to input in XML syntax, place teh string in an XML file and process it using the document() function.