Output, and associated issues

1. xsl:output - new attributes
2. XSLT2 Character maps
3. Character maps
4. Safe-guarding codepoints-to-string() from wrong input
5. Output an ampersand... or using XSLT 2.0 properly

1.

xsl:output - new attributes

Jeni Tennison

There are some additional attributes on <xsl:output>, though, namely:

escape-uri-attributes determines whether HTML URI attributes (e.g. a/@href and img/@src) are escaped automatically or not with the html and xhtml output methods. You can turn off escaping and then use the new escape-uri() function to get greater control over the escaping of these attributes.

include-content-type determines whether the <meta> element gets added with the html/xhtml output methods.

normalize-unicode determines whether the result gets normalized to Unicode Normalization Form C.

undeclare-namespaces determines whether namespaces should be undeclared or not -- this is to support XML Namespaces 1.1 in which namespaces can be undeclared.

Also note that the serialisation spec is now separate from XSLT: you can find it at http://www.w3.org/TR/xslt-xquery-serialization. Hopefully that means that it can be used elsewhere (in particular by XQuery).

2.

XSLT2 Character maps

David Carlisle

I have produced a stylesheet to generate XSLT 2 stylesheets producing character maps corresponding to the ISO and MathML entity collections available fromW3C. See the second paragraph.

This includes an xsl stylesheet corresponding to each of the ISO entity sets, one stylesheet that defines a combined character map for all of the entity sets, and a test document and sample output. There is also an archive of all the xsl files, so you can experiment with local copies.

If you are using Saxon 7.7 and want to produce output which makes use of ISO entities, you may find this useful.

Currently a mapping is defined for every entity that resolves to a single Unicode character with a code point greater than 128. Entities that resolve to more than one character, or to ascii characters are just represented in the mapping files by a comment.

Note that due to a feature of the current saxon implementation you need to explictly use method="xml" on xsl:output to use this feature (see the charmaptest.xsl file linked from the above).

3.

Character maps

David Carlisle

Experimenting with XSLT 2 character maps....

Mike Kay has implemented <xsl:character-map/> in the latest saxon

I have produced a stylesheet to generate XSLT 2 stylesheets producing character maps corresponding to the ISO and MathML entity collections available from w3c.

I have incorporated a distribution of these stylesheets in that site.

This includes an xsl stylesheet corresponding to each of the ISO entity sets, one stylesheet that defines a combined character map for all of the entity sets, and a test document and sample output. There is also an archive of all the xsl files, so you can experiment with local copies.

If you are using Saxon and want to produce output which makes use of ISO entities, you may find this useful.

Currently a mapping is defined for every entity that resolves to a single Unicode character with a code point greater than 128. Entities that resolve to more than one character, or to ascii characters are just represented in the mapping files by a comment.

Note that due to a feature of the current saxon implementation you need to explictly use method="xml" on xsl:output to use this feature (see the charmaptest.xsl file ).

4.

Safe-guarding codepoints-to-string() from wrong input

Florent Georges



> I know that control characters are not allowed and throw
> an "Invalid XML character" error. Also, when adding very
> wide numbers (like "1234567") give a plural of the same
> error (Im not sure why). Some characters (I believe these
> are the ones that are not assigned in Unicode) result in
> an empty string (like "12345").

> Is there a robust way of allowing/disallowing a set of
> codepoints (other than making one huge lookup list)?

Technically, it is not complex. Just define a function my:codepoints-to-string() that makes the needed checks and do what you want when encoutering an invalid codepoint. I think the most difficult part is identifying which codepoints are valid. You can use the following from the XML recommendation as starting point:

   /* any Unicode character, excluding the surrogate
      blocks, FFFE, and FFFF. */
   [2] Char ::= #x9
                | #xA
                | #xD
                | [#x20-#xD7FF]
                | [#xE000-#xFFFD]
                | [#x10000-#x10FFFF]

Document authors are encouraged to avoid "compatibility characters", as defined in section 6.8 of [Unicode] (see also D21 in section 3.6 of [Unicode3]). The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:

   [#x7F-#x84], [#x86-#x9F], [#xFDD0-#xFDDF],
   [#x1FFFE-#x1FFFF], [#x2FFFE-#x2FFFF], [#x3FFFE-#x3FFFF],
   [#x4FFFE-#x4FFFF], [#x5FFFE-#x5FFFF], [#x6FFFE-#x6FFFF],
   [#x7FFFE-#x7FFFF], [#x8FFFE-#x8FFFF], [#x9FFFE-#x9FFFF],
   [#xAFFFE-#xAFFFF], [#xBFFFE-#xBFFFF], [#xCFFFE-#xCFFFF],
   [#xDFFFE-#xDFFFF], [#xEFFFE-#xEFFFF], [#xFFFFE-#xFFFFF],
   [#x10FFFE-#x10FFFF].

When you have identified the (in)valid codepoints, you will have to choose what to do with (in)valid codepoints. For example, calling codepoints-to-string() for valid codepoints, and return the empty sequence or the empty string for invalid one:

   <xsl:function name="my:is-in-range" as="xs:boolean">
     <xsl:param name="value" as="xs:integer"/>
     <xsl:param name="down"  as="xs:integer"/>
     <xsl:param name="up"    as="xs:integer"/>
     <xsl:sequence select="$value ge $down and $value le $up"/>
   </xsl:function>

   <xsl:function name="my:is-valid-codepoint" as="xs:boolean">
     <xsl:param name="cp" as="xs:integer"/>
     <xsl:sequence select="
         $cp = (9, 10, 13)
           or my:is-in-range($cp,    32,   55295)
           or my:is-in-range($cp, 57344,   65533)
           or my:is-in-range($cp, 65636, 1114111)"/>
   </xsl:function>

   <xsl:function name="my:codepoint-to-string" as="xs:string?">
     <xsl:param name="cp" as="xs:integer"/>
     <xsl:if test="my:is-valid-codepoint($cp)">
       <xsl:sequence select="codepoints-to-string($cp)"/>
     </xsl:if>
   </xsl:function>

or instead the following, depending on your needs:

   <xsl:function name="my:codepoints-to-string" as="xs:string">
     <xsl:param name="cp" as="xs:integer*"/>
     <xsl:sequence select="
         codepoints-to-string($cp[my:is-valid-codepoint(.)])"/>
   </xsl:function>

5.

Output an ampersand... or using XSLT 2.0 properly

Liam Quin


I am generating svg documents and would like to output an element like:

<glyph unicode="" horiz-adv-x="833" 
 d="M124,348L124,249L709,249L709,348L124,348 Z"/>

I need to generate the value of @unicode programmatically. The
difficulty, of course, is outputting the ampersand in @unicode.


Why not just have  in your input? Or, code-points-to-string(57347) (the decimal version of xE003)

Make your output encoding be iso-8859-1 or even us-ascii to force the XSLT processor to generate numeric character references.