Special Character handling, Character Maps

1. DOE avoidance
2. Recognising special characters

1.

DOE avoidance

David Carlisle




   In addition to Michael's reasons, there is also the fact that 2.0 has
   character maps that can be used for many of the things that d-o-e is
   used for in a slightly less disruptive way architecturally.

To expand on that, one of the problems with d-o-e is that you can go

<xsl:value-of select="'&lt;br&gt;'"/> 

  <xsl:value-of disable-output-escaping="yes" 
  select="'&lt;br&gt;'"/>

and the model is that that puts something into the result tree which when linearised at the end of the process produces

&lt;br&gt;<br>

The trouble is that, as specified, there isn't really anything that can go into result tree text nodes that can have that effect, so systems that support d-o-e have to have some private internal annotations to record those bits of the tree where d-o-e is in effect.

In 2.0 you can always replace this by

<xsl:value-of select="'&lt;br&gt;'"/>
<xsl:value-of 
     select="translate('&lt;br&gt;',
                      '&lt;&gt;&amp;',
                      '&#xE001;&#xE002;&#xE003;')"/>

Together with a character mao that maps back:

<xsl:output-character character="&#xE001; string="&lt;"/> 

<xsl:output-character character="&#xE002; string="&gt;"/> 

<xsl:output-character character="&#xE003; string="&amp;"/>

This works the same way except that now the information is stored in the result tree using a mechanism that is actually allowed by the specification of the datamodel (just three otherwise unused characters).

Of course the other problems with d-o-e remain, that this still only works if serialisation is tightly controlled by the XSLT engine and that in nine times out of ten it is only being used by mistake, when the stylesheet could have gone

<br/>

and used an element node directly.

> The way I read the w3c spec, it looked like you'd have to enclose 
> those three output-character definitions in a named character map and 
> make the serializer aware of that map by specifying the name in the 
> use-character-map attribute of xsl:output.  Did you assume that or am 
> I missing something?

That's correct sorry I didn't make that clearer.

Rather than using translate explictly, if I were doing thjis I'd define a function xsl:my:d-o-e in some namespace that took a string and replaced < > and & as I suggested so you could just do

<xsl:value-of select="my:d-o-e(.)"/> 

where you needed to this.

> 
Then anywhere I use &#xE001; in my XSL I  get < in my output?

yes

> Can I map « to < if I'm confident that my XML 
> won't contain «?

yes you can map any single character

> Can I map ¤ to "some really long string I'm sick of typing" or is 
> that just going to far?  (I'm really trying to give Wendell heart 
> attacks here)

yes

2.

Recognising special characters

David Carlisle

The following statement that selects elements starting with special characters (can probably be optimized but works):

    <xsl:template match="*[matches(substring(text()[1], 1, 1),
'[&#592;-&#99999;]')]">
        <xsl:copy>
            ...
        </xsl:copy>
    </xsl:template>

strange is that the matches() regex allows only decimal values as range. 592 represents here hex 0250 IPA characters that is the beginning of special characters for us.

The regex syntax doesn't allow any numeric references in decimal or hex. You have to use a character. However before being passed to XPath, teh attribute is parsed by an XML parser so you can use any XML syntax for that character, named entity references (as defined in a dtd) teh character in the file's encoding, decimal character references (as you have) or hex character references, which would use the & # x 0 2 5 0 ; syntax.

match="*[matches(substing(text()[1], 1, 1), '[&#592;-&#65535;]')]">

It's probably simpler to just restrict the regex to the first character, rather than use substring:

match="*[matches(., '^[&#592;-&#65535;]')]">

that only selects ones up to 65533, and a lot of math characters are higher than that: there's a whole block starting at 119808 (hex 1d400) to 120831 (hex 1d7ff) script A, &Ascr; for example is character 119964.

You might be better writing this as

*[not(matches(., '^[&#31;-&#591;]'))]

Andrew Welch points out

In which case

*[string-to-codepoints(.)[1] > 591]

would be nicer, wouldn't it? It might be more efficient to substring out the first character, rather that select the first from the resulting sequence...

If you really want to use the range 31 to 591 you could do this:

<xsl:variable name="range" select="31 to 591" as="xs:integer+"/>

and then

*[string-to-codepoints(.)[1] != $range]