xslt special characters, escaping

Special Characters

1. Characer set primer
2. Understanding special characters
3. Writing a tab character
4. How to insert an angle bracket in the result
5. How to generate a carriage-return end-of-line in my XSL Stylesheet
6. How to use the disable-output-escaping attribute
7. How to output element containing HTML
8. Angle brackets in JavaScript
9. Pattern matching in text with CRLF
10. Need XML output with DOCTYPE ...
11. Writing out special characters
12. Using CDATA but still less than is inserted
13. Less Than and Greater than comparison in xsl:if
14. XML output with DOCTYPE
15. Non-Latin characters in XT output
16. How to get an apostrophe into a test expression
17. How do I use non-literals in xsl:if statements
18. Character escaping
19. Escaping a character
20. How do I drop an apostrophe
21. How to strip white-space
22. Javascript in html output method
23. How to get ampersand
24. percent signs and html output method
25. How to strip ampersands from xml code using xsl
26. How to transform each character of a string
27. How to output a newline character
28. Bulgarian or Cyrilic characters in my xsl
29. Copyright character
30. Multiple special characters in XML
31. National characters
32. Certain chars break transformation process.
33. Converting quotes
34. Special characters
35. Using Unicode entities in XSLT
36. International Characters and browsers
37. braces in result tree
38. CJK UTF-16 test
39. Ampersand for URLs
40. Fallback for unsupported entities
41. Unicode fonts
42. Special characters
43. Encoding problems, win32 platforms
44. Managing the &quote; entity
45. Ampersand in uri content
46. AVT escaping
47. Getting the Base Character from a Character with Diacritic

1.

Characer set primer

Chris Maden

Time for the primer again.

A character is an abstract notion, like "Latin capital letter A".

A character repertoire is a collection of characters - like "Latin upper-case letters". Different languages require different character repertoires.

A character set is an ordered, numbered character repertoire. ISO 8859-1 is one such character set, assigning numbers 0-255 to 256 characters. Its repertoire covers nearly all of the characters needed for western European languages like French, Spanish, German, and Italian, as well as English, Icelandic, Swedish, Norwegian, and Dutch. There are other ISO 8859 character sets that cover characters needed by other languages like Turkish, Polish, Greek, Russian, Hebrew, and Arabic.

Unicode is also a character set. It assigns the numbers 0 - (2^32)-1 to a whole lot of characters. Its repertoire includes all of the characters covered in other national and International Standards, including all of the ISO 8859 sets.

An encoding is a mapping of bit patterns to a character set. UTF-8 and UTF-16 are encodings of Unicode. In a sense, ISO 8859-1 and its kin are also encodings of Unicode, but ones that can not represent all of the characters.

In short: Unless you are working in Klingon, Minbari, or Silvestri, Unicode covers the characters you need in its repertoire. UTF-8 and UTF-16 are both capable of representing all of the characters in Unicode. All XML parsers are required to read UTF-8 and UTF-16 data.

2.

Understanding special characters

Mike Brown


> An element in the XML may have '&' chars in it and I
> can't seem to find the way to get it properly escaped.  For example the
> actual data may look like:
> 
> MARK & SCAN
> 
> and the XML for the ID element that I send to the browser looks like:
> 
> <ID>MARK&#x20;&#x26;&#x20;SCAN</ID>
	

...when this XML is parsed, it becomes the character data MARK & SCAN. That's all the XSLT processor will get to see. Your character references are just to disambiguate the character data from the markup, for the parser's benefit. You weren't actually making the character data be MARK&#x20;&#x26;&#x20;SCAN.

> In the XSL code, I simply try to build the URI as follows (not everything is
> here...)
> 
> <a>
>   <xsl:attribute name="href">
>     <xsl:value-of select="$Server"/>?ID=<xsl:value-of select="ID"/>&amp;APP=
> etc...
>   </xsl:attribute>                  
>   <xsl:value-of select="ID"/>
> </a> 
Probably easier to do something like
 <xsl:value-of select="concat($Server,'?ID=',ID,'&amp;APP=',etc)"/>
or
 <a href="{$Server}?ID={ID}&amp;APP={etc}">click here</a>

> I have <xsl:output method="html"/> set, and when I copy the link into the
> address field on the browser, I see that the spaces have converted, but not
> the & char.  The URI will look like:
> 
> http://server?ID=A%20&%20B&APP= etc... with the & 
> still explicity there hosing my URI.  
> I'm sure I'm missing something simple.

You are getting exactly what you asked for, but...

- what you see in the status bar when you mouse over the link in IE is the URI derived from what text & markup is actually in the HTML. i.e., &amp; in the document will become & in the URI.

if you want to see the actual HTML that was generated, use the 'view XSL output' context menu addition that you can download from MS. There's a link to it in the MSXML FAQ.

my testing indicates that the href comes out exactly as expected: <a href="http://www.skew.org/printenv.cgi?ID=MARK &amp; SCAN&amp;APP=foo">

- what you asked for is not what you wanted, because to be a proper URI the '&' in the first parameter must be '%26' and the spaces must be either '%20' or '+'. you didn't ask for these, so that's why you didn't get them.

Your real problem is that URL encoding is not a standard feature of XSLT/XPath. On Nov 17 I posted a way to use Java's built-in URL encoder via an extension function. That may not do you any good if you're using MSXML. Anyone have any tips for doing URL encoding with MSXML?

If you are reasonably sure about your input data, you could probably get away with just escaping a few characters that you expect to see, by using translate() to convert spaces to plus signs, and/or something like the example I provide for the rest.

3.

Writing a tab character

Ron Ten-Hove


>How do I say I want to write a tab(\t) in XSL?


Use &#9;

<xsl:text>&#9;Indented text</xsl:text>

4.

How to insert an angle bracket in the result

David Carlisle

You can not have an XML document that has a < that is not written as an entity reference of inside a cdata section, so this isn't really an XSLT restriction.

By default XSL writes XML documents (and although some XSL processors offer options to serialise the XML tree in non XML syntax)

From the WD. Normally, the xml output method escapes & and < (and possibly other characters) when outputting text nodes. This ensures that the output is well-formed XML. However, it is sometimes convenient to be able to produce output that is almost, but not quite well-formed XML; for example, the output may include ill-formed sections which are intended to be transformed into well-formed XML by a subsequent non-XML aware process. For this reason, XSLT provides a mechanism for disabling output escaping. An xsl:value-of or xsl:text element may have a disable-output-escaping attribute; the allowed values are yes or no; the default is no; if the value is yes, then a text node generated by instantiating the xsl:value-of or xsl:text element should be output without any escaping. For example,

<xsl:text disable-output-escaping="yes"><</xsl:text>

       should generate the single character <.
            

5.

How to generate a carriage-return end-of-line in my XSL Stylesheet

Mike Brown

The problem is that the parser reports XML parsing error for a character less than 0x20. Is this correct?

The specs said #xD or #xA would work, but they didn't.

&#xA; is the same as &#10; which is ctrl-J or LF (linefeed; usually sufficent) &#xD; is the same as &#13; which is ctrl-M or CR (carriage return)

When you say &#xA; (which is what I assume you meant, not #xA) doesn't work, I assume you mean to say that you're not seeing it in some certain output ... what output are you looking at, exactly? Can you give us an example? What XSL processor are you using? What result namespace are you using?

The reason I ask is because the linefeed character probably *is* getting into the result tree, but maybe you're not seeing it in the output document derived from that tree. We also had someone saying something similar recently, and it turned out they were looking at a browser's rendering instead of the HTML which had been derived from the result tree.

6.

How to use the disable-output-escaping attribute

Chuck White

I doubt if anyone else cares, but the disable-output-escaping attribute is a savior for producing newspaper classified markup (a lot of newspaper typesetting systems use < and >).

<xsl:text disable-output-escaping="yes"><</xsl:text>

To which JC added The disable-output-escaping attribute is best for those cases where you are producing something that is almost HTML or XML but not quite (eg XML with some ASP or JSP directives). If you are producing something that is not at all like XML, it's easier just to add a top-level declaration:

<xsl:output method="text"/>

7.

How to output element containing HTML

Mike Brown

Q: Expansion I have an element that may or may not be well formed XML (user entered HTML).

I want to insert it directly without any special quoting into a stylesheet output.

There simply is no getting around the fact that your XSL stylesheet must produce a well-formed result tree that could be represented as another XML document. From this result tree some kind of actual output such as HTML 4.0 will be derived. It's not a fundamental problem that XSL won't let you force part of the result tree to not be well-formed.

Your goal should be to create a well-formed result tree that somehow contains the crappy user data. You'll have to treat the user data strictly as CDATA and put it in a text node, and then you'll want to tell the XSL processor to handle the HTML output of that part of the tree specially.

As of the latest XSLT working draft (Aug 99), there is the disable-output-escaping functionality for text nodes, which will allow that user data to be output with literal < > angle brackets instead of &lt; and &gt; entities. You may find that this isn't always as useful as you'd think, but you *should* be able to use

&xsl:value-of select="MESSAGE_BODY" disable-output-escaping="yes"/>

with your

&MESSAGE_BODY>&!
[CDATA[ &A HREF="http://domain.com">http://domain.com&/A>
]]>&/MESSAGE_BODY>

to create a special text node in the result tree. The contents of the text node will be the CDATA and the presence of the disable-output-escaping="yes" will cause a conforming XSL processor to flag that node as not requiring special characters to be entified.

8.

Angle brackets in JavaScript

David Carlisle

Q: Expansion

I want to output JavaScript with a lot of "<",">" and "&" characters in it. Is there a way I can use these characters in a large block without having to write the equivalent entities every time?

Initial answer: Source unknown. Please note the caveat beneath.

Wrap the characters in <![CDATA[...]]>, like this:

   <xsl:text disable-output-escaping="yes">
   <![CDATA[

   <SCRIPT LANGUAGE="JavaScript"><!-- 
   ... foo & bar ...
   --></SCRIPT>

   ]]>
   </xsl:text>

Beware though, that if you do this you are not outputting a SCRIPT element but the characters < S C R I P T thus this stylesheet only works in a context in which you are linearising the result tree as an XML file, and then re-parsing it as XML (or HTML in this case). It wouldn't (or shouldn't) work in an embedded context where the result tree of the stylesheet is being directly accessed, eg by the layout engine of a browser.

Tony Graham Adds:

A better sequence would be:

<SCRIPT>
 <xsl:text><![CDATA[...]]></xsl:text>
</SCRIPT>

The August working draft says that "The html output method should not perform escaping for the content of the script and style elements." With an up-to-date XSLT processor, you wouldn't need the xsl:text, and

<SCRIPT
LANGUAGE="JavaScript"><![CDATA[...]]></SCRIPT>

would be fine.

9.

Pattern matching in text with CRLF

Kay Michael


How do I match element content containing CRLF?

Try:

<xsl:when 
test='normalize-space($myOpinionValue)="excellent"'>
	    5
</xsl:when>

            

10.

Need XML output with DOCTYPE ...

David White


<xsl:template match="/">
<xsl:processing-instruction
name='xml'>
	version="1.0"</xsl:processing-instruction>
<xsl:text disable-output-escaping="yes"><![CDATA[   
<!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN"
"http://www.wapforum.org/DTD/wml_1.1.xml">]]>
</xsl:text><wml>
<xsl:apply-templates/>
</wml>

</xsl:template>

works, in that it produces:

<?xml version="1.0"?>
<!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN"
"http://www.wapforum.org/DTD/wml_1.1.xml">
<wml>
[stuff]
</wml>

Horrible, isn't it?

I hope the next version of XT supports xsl:output.

11.

Writing out special characters

Ken Holman

Q expansion I'd like to output " <body bgcolor="White">" as part of my HTML output

To get the result you desire, you must add a <body> element node with a bgcolor="White" attribute node to the result tree and not try to effect the result using a string.

Using a result tree template, this would be as easy as:

   <xsl:template match="/">
     <body bgcolor="white">
       <xsl:apply-templates/>
     </body>
   </xsl:template>

Using result tree instructions, this would be:

   <xsl:template match="/">
     <xsl:element name="body">
       <xsl:attribute 
	name="bgcolor">white</xsl:attribute>
       <xsl:apply-templates/>
     </xsl:element>
   </xsl:template>

p.s. there *is* a technique to actually emit the "<" character from a string, but that technique is not portable across all XSLT engines because an engine is allowed to not implement the technique ... plus, what you want to do can be done without resorting to the technique.

12.

Using CDATA but still less than is inserted

Scott Boag

Q: Expansion
I can't figure out why I get the following:

My XML:
<text>
<![CDATA[1234 <
	A href="http://1234">http://1234</A> 1234]]>
</text>

My XSL:
<xsl:value-of select='text'/>

Produces this:

1234 <A href="http://1234">http://1234</A> 1234

I want it to produce this:

1234 <A href="http://1234">http://1234</A> 1234

Try:

<xsl:value-of select='text' disable-output-escaping="yes"/>

Mike Kay adds: This is probably the most-FAQ! Whether your input uses CDATA or entity references or character references is irrelevant, a "<" supplied in character content cannot be used to generate markup in the output document. If you want to generate an <A> element, use a literal result element <A> in the stylesheet, or <xsl:element>.

13.

Less Than and Greater than comparison in xsl:if

Dan Machak

Q: Expansion: I am attempting to perform an inequality comparison with <xsl:if>.

The obvious problem is that a ">" or a "<" in the @test attribute is an illegal character.

use > and <

14.

XML output with DOCTYPE

Dylan Walsh

<xsl:template match="/">
 <xsl:text disable-output-escaping="yes"><?xml
	version="1.0"?></xsl:text>
    <xsl:text disable-output-escaping="yes">
	<!DOCTYPE wml PUBLIC
	"-//WAPFORUM//DTD WML 1.1//EN"
	"http://www.wapforum.org/DTD/wml_1.1.xml">
    </xsl:text>
</xsl:template>

works, in that it produces:

<?xml version="1.0"?>
<!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN"
"http://www.wapforum.org/DTD/wml_1.1.xml">
<wml>
[stuff]
</wml>

15.

Non-Latin characters in XT output

Michael Kay

Q expansion I have some non-Latin characters in my xml documents as character references and I'd like to run the documents through xt and those character references would still be there. example:

Source:

  <?xml version="1.0" encoding="ISO-8859-1"?>
  <character>
    &#x0069; &#x0047; &#x0049; &#x0107;
  </character>

Stylesheet:

  <?xml version="1.0" encoding="ISO-8859-1"?>
  <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Tranform"     
	version="1.0">

  <xsl:output method="html" indent="yes"/>

    <xsl:template match="/">
      <xsl:value-of select="character"/>
    </xsl:template>

  </xsl:stylesheet>

when I run it through the latest xt, I get:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 
	Transitional//EN">
  i G I &#263;

, but I when I try to output it as xml by changing the method to "xml", I get:

  i G I &Auml; ?

and I'd like to get in my xml output:

  i G I &#x0107;

Answer:

The XSLT syntax to achieve this is <xsl:output encoding="iso-8859-1"/>. You'll have to check whether xt supports it. (SAXON 4.7 does, provided that the Java runtime does.

David Carlisle adds:

Of course you _shouldn't_ want that. As the version with utf-8 encoded output is completely equivalent to an XML application.

However, assuming that you will want that anyway, I think that the way to get this in XSL would be to change the output encoding from utf-8 to anything else which does not directly encode position &#x0107; then this slot will have to be output as that (or the decimal equivalent)

so in otherwords you want

<xsl:output
  method="xml"
  encoding="iso-8859-1"/>

However xt (the new release to match the new PR) says:

The xml output method ignores the encoding, doctype-system, doctype-public, so currently I don't think you can do this in xt (without a lot of pain)

David later explained,

The xml/unicode character set consists of the numbered characters in the range 1 through to hex 10FFFF (with some slots disallowed, but ignore that for now).

That is the `Universal Character Set (UCS)'

utf8 is a particular encoding of that range (actually it can encode the full UCS4 range, up to hex FFFFFFFF, although `only' the first 17 planes of 2^16 characters are currently in Unicode (and only the first 2^16 characters up to FFFF are in Unicode 2.x)

Note that utf8 is just an `encoding' of the 32bit character number into 1 or more sequences of 8bit bytes, it does not re-order or subset the available characters.

Now `traditional' encodings like `latin1' or `latin2' or `windows ansi' or `microsoft code page 850' or the 8bit cyrillic encodings are subsets of the available characters in UCS (if they are not subsets they can not be used in XML as the underlying character set in XML is always unicode).

If you say

<?xml version="1.0" encoding="microsoft-weirdness" ?>

then the available characters and the way they are encoded as bytes (ie effectively their order) is whatever Bill Gates says it is. So the byte with value 255 may or may not be y-umlaut (which is what position 255 is in latin1 and unicode) However the syntax &#255; (and equivalently &#xFF;) _always_ refers to the unicode numbering not the current encoding used to decode bytes of character data.

So....

If the encoding is the default utf8 encoding and an XML system wants to output the character hex 107 (which is c-acute) then it can _always_ output it as either

&#x107; or &#253;

however since that is 6 or 7 bytes, if the xml declaration specifies an encoding for character data that includes this slot then probably the system will just do that. This is a latin-2 character so if the encoding is specified as latin-2 then c acute can be encoded in the single byte with value 230. If the encoding is utf8 then there will be a two byte representation of character position 263, as shown in the original posters question.

Since the request in this case was to force the system to use the character reference form, the actual encoding for the character data did not matter, as long as this character was _not_ part of the encoding.

If you pick latin-1 (or ascii, or presumably a cyrillic encoding) then in that encoding there is no encoding for c-acute ie no encoding for unicide #x107, so with any of these encodings the only way to get a c acute is to use &#107; (actually you could use c followed by a combining acute character, but whether or not that is the same thing depends on who you are, and what you are doing...)

16.

How to get an apostrophe into a test expression

Michael Kay

Q expansion. How can I refer to a quote inside a quoted string inside another quoted string?

You have to remember that the entity reference gets expanded before the XPath parser sees it. So you will need to use the opposite kind of quote from the one XPath is using. Try

<xsl:when test='substring($Nom,1,2) = "l&apos;"'>

Pete Johnston adds:

There may be neater solutions, but I think setting up a variable and then using the variable name in your substring function call works with XT:

e.g.

either

<xsl:variable name="lapos">l&apos;</xsl:variable>
or
<xsl:variable name="lapos">l&#39;</xsl:variable>

and then

<xsl:when test="substring($Nom,1,2) = $lapos">

17.

How do I use non-literals in xsl:if statements

David Carlisle

You are probably looking at the value of the variable outside the scope of where it is bound.

A common mistake is

<xsl:if test="something">
  <xsl:variable name="x" select="something_else"/>
<xsl:if>
    ...$x ...
which doesn't work as the scope of the binding of $x ends at the
element containing the binding, ie the </xsl:if> here.

Given: 
<file previous_date="03/13/99">
  <file_date>04/03/99</file_date>
</file>
      
            

If you are on the file node then the string value of . will include the white space, in which case you might want to use normalize-space(.) but safer would be normalize-space(file_date)= @previous_date

If you are on the file_date node, then you want normalize-space(.)=normalize-space(../@previous_date). The normalize-space() can be omitted if you know there is never any white space in these strings.

18.

Character escaping

David Carlisle


>Here's my Python code: 
>j="Ain't she pretty."
>Now I have a problem when I put this in an XML attribute value
><button onclick="j="Ain't she pretty""/>


<xsl:if test=
  "@onclick=concat('j="Ain',"'t she
  pretty",'"')">

the other way of getting " and ' into the same string, without using concat() is <xsl:variable name="foo">j="Ain't she pretty."</xsl:variable>

19.

Escaping a character

Jeni Tennison

Reported by Kathy Burke

Here is a template from Jeni Tennison - works great for escaping all sorts of things. Just call it from wherever you need it.

<xsl:template name="escape-apos">
 <xsl:param name="string"/>
<xsl:variable name="apos" select='"&apos;"' />
<xsl:choose>
 <xsl:when test='contains($string, $apos)'>
  <xsl:value-of select="substring-before($string,$apos)" />
	<xsl:text>\'</xsl:text>
	<xsl:call-template name="escape-apos">
	 <xsl:with-param name="string"
          select="substring-after($string, $apos)" />
	</xsl:call-template>
 </xsl:when>
 <xsl:otherwise>
  <xsl:value-of select="$string" />
 </xsl:otherwise>
</xsl:choose>
</xsl:template>

20.

How do I drop an apostrophe

David Carlisle

Given an XML document like this:

   <foo>
     <bar name="Jim's Code">abc</bar>
   </foo>

How do I get a desired output XML document like this:

   <jimscode>abc</jims_code>

Specifically, I'm looking for how to drop apostrophes. I have figured out how to translate (and drop) characters in general using code like this:

   <xsl:variable name="upperCaseChars" 
	select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
   <xsl:variable name="lowerCaseChars" 
select="'abcdefghijklmnopqrstuvwxyz'"/>
   <xsl:variable name="uCaseCharsPlus" 
select="concat($upperCaseChars, ' 
/-(),+')"/>
   <xsl:variable name="lCaseCharsPlus" 
select="concat($lowerCaseChars, '')"/>

   <xsl:variable name="element_name" 
select="translate(@name, 
$uCaseCharsPlus, $lCaseCharsPlus)"/>
   <xsl:element name="{$element_name}">
     ...
   </xsl:element>

But I can't get it to work with apostrophes. I have tried escaping it using &#x27;, &#39;, and &apos;, but no luck.

Answer:

To get an apoststrope into your string, if you are using select= syntax you have to use " to surround the xslt string. Then if you are using " for the XML attribute value, you have to use &quot; for the XSLT " resulting in

<xsl:variable name="uCaseCharsPlus"
select="concat($upperCaseChars, &quot;
/-(),+'&quot;)"/>

21.

How to strip white-space

David Carlisle


Q expansion:
>I tried
><xsl:element name="A">
>        <xsl:attribute name="href"><xsl:value-of
>select="myValue"/></xsl:attribute>
>        <xsl:text>This is a link.</xsl:text>
></xsl:element>

Don't do that, do this

<xsl:element name="A">
        <xsl:attribute name="href"><xsl:value-of
select="normalize-space(myValue)"/></xsl:attribute>
        <xsl:text>This is a link.</xsl:text>
</xsl:element>

or simpler, this

<A name="{normalize-space(myValue)}">
  <xsl:text>This is a link.</xsl:text>
</A>

22.

Javascript in html output method

Michael Kay

> 
> If the desired output document should contain:
> 
> <h1> <%= request.getParameter("title") %> </h1>
> 
>In the stylesheet I can use <xsl:text> with 
>disable-output-escaping as shown below.  
>Obviously, this is very verbose and I'd like to avoid it.

You could try generating it in the stylesheet as a processing instrucion

<processing-instruction name="jsp">=
request.getParameter("title")</processing-instruction>

And then put in an output handler to produce the <% %> syntax on output: the exact details will be product dependent but most XSL processors allow you to pass the output to a SAX DocumentHandler.

23.

How to get ampersand

David Carlisle

Q expansion I want to create something like the following with my XSL stylesheet:

<a href="ctdmr.exe?ct=id_1_1_1&vw=NOTct#top" target="textContent" ...>

but I cannot figure out how to get the ampersand in the string. If I just use an ampersand, I get an XslProcessorException because it thinks it is an entity reference, and if I use &amp; OR & I get &amp; in the output instead of just the ampersand.

Is there a way to get a single ampersand in the HREF attribute value? ( I try <xsl:text disable-output-escaping="yes"> and <xsl:output method="HTML"> and I have the same results. )

Answer: > I get &amp; in the output instead of just the ampersand. That's what you want. Your HTML or XML application will (or should) expand that before passing it to the cgi-bin application.

24.

percent signs and html output method

Mike Kay

> 
> If the desired output document should contain:
> 
> <h1> <%= 
	request.getParameter("title") %> </h1>
> 
> 1- In the stylesheet I can use <xsl:text> with 
> disable-output-escaping as
> shown below.  Obviously, this is very verbose and I'd like to 
> avoid it.
> 

You could try generating it in the stylesheet as a processing instrucion

<processing-instruction name="jsp">=
request.getParameter("title")</processing-instruction>

And then put in an output handler to produce the <% %> syntax on output: the exact details will be product dependent but most XSL processors allow you to pass the output to a SAX DocumentHandler.

25.

How to strip ampersands from xml code using xsl

David Carlisle

It depends what you mean. If you mean ampersand characters in the character data of a text node, then you just need to use

select="translate(.,'&','')"

where . in the first argument means the text of the current node, could be any xpath expression.

If you mean stripping off ampersands in the concrete markup in an xml file then basically you can't as & is the entity reference start character and the XML parser will expand the entities before passing the input to the XSL system.

So if you have
<test>
 this &amp; that
</test>

the first will give you " this that" if you want this amp; that then in this case you could do it as you just need to replace all ampersands by `amp;' but in general you can't as there is no record that there was an entity reference in the input tree, so except for entities pointing at single characters you are stuck.

26.

How to transform each character of a string

David Carlisle


Q expansion:
The numbers (1 to 0) have a code corresponding to letters with
accents.
Examples :
1 is &acirc;
2 is &ecirc;
3 is &icirc;
and so on.

So when I have for example
<mn>132</mn>
in the source, I'd like to have
&acirc;&icirc;&ecirc;
in the HTML output.

If acirc is what I think it is then that is a single unicode letter so that one could be done with translate(.,`1`,'&#xE5;') The other ones if they are ring-accent on letters are not single unicode characters so you can't use translate other than on a one to one basis, but you could use this:

<!-- replace all occurences of the character(s) `from'
     by the string `to' in the string `string'.-->
<xsl:template name="string-replace" >
  <xsl:param name="string"/>
  <xsl:param name="from"/>
  <xsl:param name="to"/>
  <xsl:choose>
    <xsl:when test="contains($string,$from)">
      <xsl:value-of 
	select="substring-before($string,$from)"/>
      <xsl:value-of select="$to"/>
      <xsl:call-template name="string-replace">
      <xsl:with-param name="string" 
	select="substring-after($string,$from)"/>
      <xsl:with-param name="from" select="$from"/>
      <xsl:with-param name="to" select="$to"/>
      </xsl:call-template>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="$string"/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

Sebastian Rahtz adds, for a text output ensure that the output encoding is set appropriately for your use, e.g. <xsl:output method="text" encoding="ISO-8859-1"/>

Note however that xt only supports specifying output encoding in the text method. If you are outputting XML it will ignore this. (Saxon supports this though)

27.

How to output a newline character

Ken Holman


I just use:

   <xsl:text>&#xa;</xsl:text>

or if there are a lot of them in the file:

<!DOCTYPE xsl:stylesheet [
<!ENTITY nl "&#xa;"><!--new line-->
]>
...
    <xsl:text>&nl;</xsl:text>

or, for MSDOS text output I just redefine the entity:
<!ENTITY nl "&#xd;&#xa;">

28.

Bulgarian or Cyrilic characters in my xsl

Nikolai Grigoriev

It depends mostly on whether the appropriate fonts are installed on your machine. IE4-5 under Win95/NT works fine with UTF-8 if you have the appropriate charset (204) in your fonts; normally, Times New Roman, Arial, and Courier New contain the charset 204, and all other do not. In the same environment, you may also try windows-1251 as the charset name.

If you are under Unix, try either koi8-r (sometimes spelled as koi8r, without dash) or iso-8859-5; chances are that you have at least one of the two.

Another problem is whether your XSLT processor is able to handle any of these. I admire XT but it still lacks support for anything but UTF-8 in the output; SAXON is much more foreigner-friendly ;-).

Please note that UTF-8/Unicode, windows-1251, koi8-r and iso-8859-5 are all mutually incompatible. If you were about to publish Cyrillic texts in Russian over the Internet, I would recommend using koi8-r. Bulgarian uses the same repertory of glyphs, but I don't know which is the preferred charset; they could also have a fifth version ;-).

29.

Copyright character

David Carlisle

> Why saxon is not allowing this character?.

Not really saxon, it's the XML parser you are using with Saxon.

what do you mean by `this character' If you mean unicode character xA9 (copyright) then you can always refer to that as &#xA9; in any encoding.

but if you want to refer to it just using character data you need to be using an encoding that includes the copyright character and then use whatever bytes are required in that encoding.

In utf-8 (the default xml encoding) character 169 is encoded as two bytes, alternatively you could encode your XML file in latin 1 encoding and then copyright is encoded as one byte, © , but then you need to start the xml file <?xml version="1.0" encoding="iso-8859-1"?>

You can always use &#xA9; in place of characters in XML character data in elements or attribute values. You just can't use it in XML names. So by using latin 1 you don't restrict the characters you can use in data, But you do restrict the names you can use for elements and attributes to just using latin 1 characters.

so <xsl:variable name="x" select="'&#xA9;'"/> defines $x
to be the _single_ character (string-length($x) is 1)

30.

Multiple special characters in XML

Zsolt Czinkos

My XML documents look like this:

<?xml version="1.0"?>
<!DOCTYPE doc [
<!ENTITY % SpecChars SYSTEM "spechars.ent">
%SpecChars;
]>
<doc>
</doc>

The "spechars.ent" file contains the entity declarations or further references. I downloaded entity declaration files from: schema.net

To which Wendell Piez added

The SYSTEM identifier takes a URI, so it could be a relative path. Any entities declared in spechars.ent will be parsed in the document.

Be warned, however, that a validating parser will complain that your DTD lacks declarations for elements. But a parser checking only for well-formedness (such as is on most XSL processors) should pass it fine.

31.

National characters

David Carlisle


> is any simple method to convert the national
> characters Å Ä Ü Ö and such to a format that the
> XML parser understands?

Most likely you don't need to convert anything, just tell the parser you are using Latin1 stick this

<?xml version="1.0" encoding="iso-8859-1"?>

at the top

32.

Certain chars break transformation process.

Mike Brown


> Where have I gone wrong?  Whenever I try to perform an XSLT transformation 
> on an XML file which has the registered symbol (?), the XSLT programes read 
> the symbol as two characters (? and ?)
    

Your XML file is using an encoding that is different from what the XML parser (which feeds info about the document to the XSLT processor) thinks it has. Is there an encoding="..." specification in the <?xml ...?> line at the beginning of the file? What does it say? What created the file? Was it a simple text editor that didn't give you the option of selecting an encoding/character set to use?

Most likely what you see as the circle-R in your editor is stored on disk as a single byte, 0xAE, which is how that character is represented in iso-8859-1 and cp1252. The XML parser is following rules outlined in the XML spec for deciding what character set was used to encode that XML, and is decoding the bytes accordingly. It is probably deciding utf-8 is the character set that was used.

You must change the XML file to declare the correct encoding, as it is an error for a document to declare the wrong encoding. However you should note that XML parsers are only required to support utf-8 and utf-16, so it may be necessary to check your parser's documentation to see whether it supports the actual encoding of the document.

Aside from that, choose one:

Change the encoding of the XML file using a tool like Free Recode so that it matches what the file delcares its encoding to be
Change the XML file to use character references instead of literal characters, for characters that are outside the ASCII range (0x20-0x7E) ... for example, &#xAE; Note that such references are for the ISO/IEC 10646-1 universal character set, commonly though not accurately thought of as Unicode) ... this way your XML file will be entirely ASCII bytes, and since ASCII is a subset of UTF-8, it will be fine if the parser interprets it as UTF-8.

33.

Converting quotes

Michael Kay


> <xsl:when test="contains($string, '"')">
    

An XML attribute can't contain both double and single quotes. I generally try to avoid the complexities of XML escaping by using

<xsl:variable name="quot">"</xsl:variable>
<xsl:when test="contains($string, $quot)">

34.

Special characters

Mike Brown


> An ID element in the XML may have '&amp;' chars in it and I
> can't seem to find the way to get it properly escaped.  For example the
> actual data may look like:
> 
> MARK &amp; SCAN
> 
> and the XML for the ID element that I send to the browser looks like:
> 
> &lt;ID>MARK&amp;#x20;&amp;#x26;&amp;#x20;SCAN&lt;/ID>
	

...when this XML is parsed, it becomes the character data MARK &amp; SCAN. That's all the XSLT processor will get to see. Your character references are just to disambiguate the character data from the markup, for the parser's benefit. You weren't actually making the character data be MARK&amp;#x20;&amp;#x26;&amp;#x20;SCAN.

> In the XSL code, I simply try to build the URI as follows (not everything is
> here...)
> 
> &lt;a>
>   &lt;xsl:attribute name="href">
>     &lt;xsl:value-of 
>  select="$Server"/>?ID=&lt;xsl:value-of select="ID"/>&amp;amp;APP=
> etc...
>   &lt;/xsl:attribute>                  
>   &lt;xsl:value-of select="ID"/>
> &lt;/a> 

Probably easier to do something like

 &lt;xsl:value-of select="concat($Server,'?ID=',ID,'&amp;amp;APP=',etc)"/>
or
 &lt;a href="{$Server}?ID={ID}&amp;amp;APP={etc}">click here&lt;/a>
> I have &lt;xsl:output method="html"/> set, and when I 
> copy the link into the
> address field on the browser, I see that the spaces have converted, but not
> the &amp; char.  The URI will look like:
> 
> http://server?ID=A%20&amp;%20B&amp;APP= etc... 
> with the &amp; still explicity there
> hosing my URI.  I'm sure I'm missing something simple.

You are getting exactly what you asked for, but...

- what you see in the status bar when you mouse over the link in IE is the URI derived from what text &amp; markup is actually in the HTML. i.e., &amp;amp; in the document will become &amp; in the URI.

if you want to see the actual HTML that was generated, use the 'view XSL output' context menu addition that you can download from MS. There's a link to it in the MSXML FAQ.

my testing indicates that the href comes out exactly as expected: &lt;a href="http://www.skew.org/printenv.cgi?ID=MARK &amp;amp; SCAN&amp;amp;APP=foo">

- what you asked for is not what you wanted, because to be a proper URI the '&amp;' in the first parameter must be '%26' and the spaces must be either '%20' or '+'. you didn't ask for these, so that's why you didn't get them.

Your real problem is that URL encoding is not a standard feature of XSLT/XPath. On Nov 17 I posted a way to use Java's built-in URL encoder via an extension function. That may not do you any good if you're using MSXML. Anyone have any tips for doing URL encoding with MSXML?

If you are reasonably sure about your input data, you could probably get away with just escaping a few characters that you expect to see, by using translate() to convert spaces to plus signs, and/or something like the stylesheet on my webpages for the rest.

35.

Using Unicode entities in XSLT

Mike Brown

In your XML you must declare what the 'hellip' entity is. This entity is not predefined in XML. I am surprised you can even get it to parse without having this entity defined.

Add

<!DOCTYPE text [ <!ENTITY hellip "&#x2026;"> ]>

to your XML.

When the XML parser in Instant Saxon reads the &hellip; it will report the horizontal ellipsis character, not the string of 8 characters & # x 2 0 2 6 ;, nor the string of 8 characters & h e l l i p ;.

If you need the full set of SGML entities, get http://www.oasis-open.org/cover/xml-ISOents.txt and reference it like

<!DOCTYPE text SYSTEM "xml-ISOents.txt">

or pull it into your real DTD if you already have one.

36.

International Characters and browsers

Mike Brown


> we have had problems with the encoding set in our
> documents [affecting] what was displayed in the browser
	

This might have to do with whether the browser has been properly configured to detect and use the encoding that you set. You also do not want to just make up encodings; the document has only one encoding and if you declare it to be something that it is not, it is only natural that the browser will misinterpret it.

Since you said you are fuzzy on encoding...

The encoding that we are talking about here is the mapping of characters (which are abstract) to sequences of bits (which are.. less abstract). Strictly speaking, this is a "character encoding scheme".

Take, for example, the non-breaking space character, which in HTML we often write as "&nbsp;", a predefined (in HTML, not XML) entity reference defined as equivalent to "&#160;", which in turn is interpreted as the single non-breaking-space character. Different encoding schemes will represent this character as different bit sequences.

For example, in the "iso-8859-1" encoding, the non-breaking space character maps to the bit sequence 10100000, an 8-bit byte representing a value that we can also easily express as decimal 160 or hex A0. But in "utf-8", the non-breaking space maps to the bit sequence 11000010 10100000. If we interpret this as a pair of 8-bit bytes, we could say they represent the values hex C2 followed by A0 (192 and 160).

Now imagine you are the web browser, receiving an HTTP message containing an HTML document. All you see in the message is a stream of bits. How do you know what 11000010 10100000 means?

If you think the document is encoded using utf-8, you'll correctly interpret this sequence as one single NO-BREAK SPACE character (that's its Unicode name).

If you think the document is encoded using iso-8859-1, you will incorrectly interpret it as *two* characters: (0xC2) LATIN CAPITAL LETTER A WITH CIRCUMFLEX followed by (0xA0) NO-BREAK SPACE.

Where do you get info about the document's encoding? Well, there are 3 places to get it:

from the transport (e.g., one of the HTTP message headers)
from within the document itself (e.g., assume the document is us-ascii encoded, read until you find a META tag that is intended to mean the same thing as the HTTP header from the first option, then reprocess the document using whatever encoding was declared there); or
by analyzing the bit sequences in the document and making an educated guess

The first option is supposed to take highest precedence. I believe that in the case of HTML documents, the second option has higher precedence, in practice, even though it is in violation of the relevant specs for it to do so.

The last option is difficult, but browsers will make a stab at it if properly configured. XML makes this option much more feasible in XML parsers than HTML does for HTML user agents because you know an XML document always begins with the bits for "<?xml ", possibly preceded by a UTF-16 byte order mark, and if it doesn't then the parser is required to assume it is UTF-8 or UTF-16 (and it is an error if the document is not!).

If you use the first or second options, you have to be sure that the encoding being declared is accurate. If you saved your document to disk from a text editor, it exists on disk as (essentially) a sequence of bits, so it must have been subjected to some encoding. Your editor might have given you the option of choosing this encoding. If it didn't, then it probably stored it using the encoding that is your operating system's default, which can vary depending on the OS and locale. (e.g. windows-1252 aka cp1252 on USA versions of Windows. You must declare this encoding, or an encoding that is a superset of it, to be the encoding of the document.

So let's say you have in your HTML document a declaration of the encoding that was used, and that this declaration is accurate. Whether or not the browser will actually honor this declaration and decode it appropriately is an entirely separate matter!

You will find that in many cases, if you have not gone to the trouble of configuring your browser to auto-detect the encoding, it will proceed under the assumption that the document is in some default encoding that it shipped with, or the one that your operating system uses by default.

Above and beyond this, most browsers give you the option of manually resetting the encoding while you are viewing a page, which really means you are choosing to *decode* the document's bit sequences according to that particular scheme.

I saw a post on the Unicode list, I believe, from Microsoft explaining that in one of their 3.x browsers they either had auto-detect on by default, or they didn't allow users to override the encoding.. and this resulted in innumerable complaints from people who could no longer view web pages with misdeclared encodings. Apparently there are a lot of Shift-JIS documents out there claiming to be ISO-8859-1, and people needed to be able to make the browser ignore these misdeclarations by default so their surfing experience wouldn't be traumatic.

FWIW, http://www.hclrss.demon.co.uk/unicode/ contains some good info that compares encoding support in various browsers.

I'll also point you to 2 articles I wrote recently (Feb 2001): i18n and XML vs. HTTP

37.

braces in result tree

Michael Kay


> I want to output
> <a style="{font: 10pt arial}" href="/path/page.jsp">click</a>
>
	

The curly braces denote an attribute value template, the contents are treated as an XPath expression. If you want the generated HTML attribute to contain curly braces, double them:

<a style="{{font: 10pt arial}}" 
	  href="/path/page.jsp">click</a>

38.

CJK UTF-16 test

Mike Brown

Mike, slowly becoming the XSLT list character guru!

> XML does NOT support UTF-16 since UTF-16 includes the surrogates

Wow, strike that from the archives, because it's dead wrong.

XML is specified in terms of sequences of allowable ISO/IEC 10646-1 characters, not particular binary-encoded representations of those characters.

>    [2]   Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>                   [#x10000-#x10FFFF]

These are characters, not UTF-16 bytes.

In ISO/IEC 10646-1 and Unicode _there is no character_ at code point 0xD800.

And in a UTF-16 encoded document, the bit sequence that I would write in hex as D800 (big endian) or 00D8 (little) are not a character. The *sequence* D800 DC00 (big) represents character #x10000, which I write here using the same notation as the EBNF excerpt you quoted from the XML spec.

If you were to say that an XML document can contain a "character" #xD800 then you would a.) be in violation of the definition of character as being what from ISO/IEC 10646-1 (which XML relies on), and b.) have no way of representing that character in a UTF-16 encoded document, because by definition, D800 in UTF-16 is the first half of a surrogate pair, not a character...

Michael Beddow comes in with the moderators view.

I think this misunderstanding is mostly down to an unfortunate turn of phrase in the comment on the relevant production rule in the spec.

/* any Unicode character, excluding the surrogate blocks ... */

I can see why people assume that "excluding" *qualifies* the phrase "any Unicode character", whereas in fact it simply *explains* it, since values in the surrogate blocks are by definition NOT Unicode characters, though they may be part of a sequence that represents such a character.

39.

Ampersand for URLs

Mike Brown

URIs identify resources by name (URN) or location (URL). A URI is just a sequence of ASCII characters, always. You can use the following characters freely:

a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
0 1 2 3 4 5 6 7 8 9
- _ . ! ~ * ' ( )

In addition, you can use the following charcters, but they are earmarked as "reserved". Exactly what their reserved purpose is depends on the scheme, i.e. URIs that begin with "http" use the HTTP scheme and the meaning of the reserved characters is defined in the HTTP spec.

; / ? : @ & = + $ , 

You can also use %, but only to indicate the beginning of an escape sequence of the form %XX where XX are 2 characters that form a hexadecimal number from 00 to, theoretically, FF, although the meaning of anything above 7F (the upper limit of ASCII) is questionable.

The reason you would want to use a %XX escape sequence is so you can - represent a reserved character being used for something other than its reserved purpose - represent the other ASCII characters that exist but that are disallowed in URIs, like %20 for the space character, %22 for the double quote, etc.

Therefore, "%26" and "&" in a URI are *not* equivalent, because "&" is a reserved character... "%26" means just a "&" as if it were not a reserved character. This is very similar to the concept of using "&lt;" instead of "<" in XML where you want to say "<"-as-character-data-not-markup.

*pulls nail out of coffin*

In an HTTP URL that is being used to submit HTML form data, "&" in the query part of the URL has the reserved purpose of being a separator between each name-value pair for each form control. "&" is your only option for writing this separator. "%26" is what you write if you want to make the name or value in the form data contain an ampersand character. Make sense? That's why your server gave you an error when you changed "&" to "%26".

To illustrate,

http://foo/bar?field1=hello&field2=world means that field1 has the value "hello" and field2 has the value "world".

http://foo/bar?field1=hello%26field2=world means that field1 has the value "hello&field2" and there is a nameless field with value "world".

> Both IE5.5 and NS 4.73 on Windows accepted a plain '&' 
> in the url string in
> the <a> element - they didn't require the &amp; .

They do this to retain compatibility with malformed HTML documents.

Since HTML doesn't have a concept of well-formedness, document authors get away with using "&" to mean something other than the start of an entity or character reference.

It does not follow that &amp; is not the correct way to do it, nor does it follow that IE and NS won't accept it if you do it that way.

To further clarify, even though it has been stated in this thread several times already,

http://foo/bar?field1=hello&field2=world is the URI.

If you want to write this in an XML or HTML document, you need to write http://foo/bar?field1=hello&amp;field2=world or http://foo/bar?field1=hello&#38;field2=world

HTML browsers will accept http://foo/bar?field1=hello&field2=world even though they shouldn't.

The thing you have to keep in mind is that the document is interpreted before it is acted on. This is true for XML and HTML. When <a href="http://foo/bar?field1=hello&amp;field2=world"> is in the HTML, it is interpreted to mean

an "a" element with an "href" attribute having value "http://foo/bar?field1=hello&field2=world"

and this is later used to construct an anchor that, when activated, will take the user to http://foo/bar?field1=hello&field2=world.

Given that this is the same in XML/XSLT as it is for HTML, I don't see why so many people are thrown by it. They assume that the href value is copied verbatim and used as the literal URL that will appear in the Address/Location section of their browser.

I wrote a bit about using non-ASCII characters in URIs at my website

40.

Fallback for unsupported entities

Michael Westbay

I output in iso-2022-jp, and none of those entities work well. My solution was to override their use like this:

  <xsl:template name="dingbat.characters">
    <xsl:param name="dingbat">bullet</xsl:param>
    <xsl:choose>
      <xsl:when test="$dingbat='bullet'">o</xsl:when>
      <xsl:when test="$dingbat='copyright'">(C)</xsl:when>
      <xsl:when test="$dingbat='trademark'">(TM)</xsl:when>
      <xsl:when test="$dingbat='trade'">(TM)</xsl:when>
      <xsl:when test="$dingbat='registered'">(R)</xsl:when>
      <xsl:when test="$dingbat='service'">(SM)</xsl:when>
      <xsl:when test="$dingbat='nbsp'">&amp;nbsp;</xsl:when>
      <xsl:when test="$dingbat='ldquo'">"</xsl:when>
      <xsl:when test="$dingbat='rdquo'">"</xsl:when>
      <xsl:when test="$dingbat='lsquo'">`</xsl:when>
      <xsl:when test="$dingbat='rsquo'">'</xsl:when>
      <xsl:when test="$dingbat='em-dash'">-</xsl:when>
      <xsl:when test="$dingbat='mdash'">-</xsl:when>
      <xsl:when test="$dingbat='en-dash'">-</xsl:when>
      <xsl:when test="$dingbat='ndash'">-</xsl:when>
      <xsl:otherwise>
        <xsl:text>o</xsl:text>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

Even when the browser did understand some of the entities, the character encoding took precedence and caused most to escape to double byte characters.

41.

Unicode fonts

Mike Brown



 > 1.  Is there a standard font for unicode and where can I 
 get it? (what
 > do you use?)
 

If you have MS Office, you can pop in the CD and go to the installer, do a custom install and choose to add what I think it calls the 'international font' or some such.. this will install Arial Unicode MS.

Alternatively, you can get it from microsoft.com

I think there's a license issue with this, so you can't redistribute it, and you are expected to own MS Publisher 2000 (though it's not actually required to use the font).

In theory, you shouldn't need a font that covers all of Unicode, if you have the coverage you need in different fonts. Generally a font is tuned for a particular writing system ('script') like Latin, Arabic, etc., and only has coverage for that script plus a little extra. Again in theory, the OS or an application can peek into the font files and know the character coverage for each font, so when the preferred font doesn't have a certain character, an alternate font from the same family can be used for just that one character. IE 5.0 and above for Windows does provide for this to some extent.

The info at unicode.org may also be of interest to you.

42.

Special characters

Mike Brown


> I am aware of very few character entities like &lt; &gt; etc.
> where can i get  details of those

How about the XML 1.0 specification? But the list is very short:


  &lt;   <
  &gt;   >
  &amp;  &
  &quot; "
  &apos; '

These are the only built-in character entities. You never actually need &gt;, and you rarely need &quot; or &apos; except in attribute values that use the same kind of quotes as delimiters around the value. Examples:


  <foo bar="&quot;hello world&quot; isn't what I said"/>
  <foo bar='"hello world" isn&apos;t what I said'/>

You can load up your XML document's DTD with more entities if you like. A standard set of SGML entities (the ones used by HTML) can be obtained at http://www.oasis-open.org/cover/xml-ISOents.txt

So for example, in your stylesheet, if you really want to use &nbsp;, and typing numeric character references like &#xA0; or &#160; or the literal character (alt-0160 on WinNT,2K,XP) is too difficult for some reason, then you can put


  <!DOCTYPE xsl:stylesheet SYSTEM "xml-ISOents.txt">

in the document. (and don't forget to put standalone="no" in your XML declaration).

If typing numeric character references like &#1234; is easier (I would say it certainly is, and is guaranteed to work), then you can get a list of what numbers correspond to what characters by looking at the Unicode charts at unicode.org

Also try macchiato.com ..but not that it will only show you printable characters that your browser is capable of displaying. On the plus side it shows you UTF-8 and UTF-16 code value sequences.

In Unicode there are currently 95221 non-private characters assigned numbers between 0 and 917631 (the range goes up to 1114111, hex 10FFFF), so don't expect it all to fit on a chart that you can put in your pocket. It's worth buying the Unicode Standard (hardcover book) if you need such a reference.

43.

Encoding problems, win32 platforms

Tom Passin



 > It would depend on the User Agent, not the platform. If 
 > this is actually
 > true for any "recent" version of IE (let's say, since 4.0), 
 > I'd like to  see  some evidence before I believe it :-)

I just did an experiment that verified what each of us said. I created an xml file on my Windows 2000 machine with a &#174; in it. I transformed it with an identity transform twice, first with encoding='utf-8 and second with encoding='iso-8859-1'.

Looking at the hex bytes, the iso results contained a hex AE byte, which is correct for character 174. The utf-8 results contained the two hex characters C2 AE, which I presume is right for utf-8. Both results displayed the registered trademark symbol, the one with the the r in a circle.

I copied the results to a floppy and took it over to my Win95/SP2 computer, then displayed the results in IE 5.5. Both files displayed the same, showing the right symbol. This is what you said would happen.

I also loaded each result into Notepad on Win95. Notepad displayed the iso file correctly, but not the utf-8 result (it showed that "A" character with a little circle above it), ahead of the trademark symbol. This is what I was suggesting would happen. BTW, Notepad on the Win2000 computer did display both results correctly.

Summarizing, what you will see displayed for high-order characters can depend on the encoding, OS, and the viewing program. On older versions of Windows, at least, non-browsers are likely to display the wrong thing.

In fact, even on my Win2000 machine, using XML Cooktop to run and display the transformation gave an incorrect display (and it uses the IE activeX control to display the results!), so you can't be sure even on Win2000 that high order characters will display the intended way, depending on the app.

Try it yourself on your system. Here are the files:


 <?xml version='1.0' encoding='utf-8'?>
 <data>Here is a ==&#174;== character</data>
 
 <xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output encoding='utf-8'/><!--Or change to iso-8859-1-->
 
 <!-- Identity transformation template -->
 <xsl:template match='*|@*'>
  <xsl:copy>
   <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
 </xsl:template>
 
 </xsl:stylesheet>

44.

Managing the &quote; entity

Wendell Piez


>Does anybody nows why &quote; is so hard to handle?

Because you have to get it past *both* the XML parser (which the escaping of " into &quote; will do) and the XPath expression parser.

E.g. <xsl:value-of select='translate(.,"&quote;","-")'/>

the XML parser resolves this to

translate(.,""","-")

which doesn't make much sense in XPath.

The usual workaround is to bind the problem character (or string) to a variable, so you can evaluate the expression

translate(.,$quote,"-")

which isn't a problem.

Make more sense now? This double layer of indirection for "hot" characters is part of the price we pay for the expression language wrapped in XML syntax. If XPath had a completely different set of significant characters (delimiters) from XML -- things would be confusing in a different way.

45.

Ampersand in uri content

David Carlisle



  The problem is with the query string of the url (the bit after the ?).
  Afaik, name-value pairs need to separated ampersands, such as:

  ?one=value1&two=value2&three=value3

That's what the URL has but like any string, to put that into an XML file or HTMl file you have to quote the &.

Some legacy browsers try to be kind by not enforcing that you quote the & but that's not kind, only confusing.

If you type the above into a location/address bar it has to be as above, but if you type it into a src attribute it has to be


?one=value1&amp;two=value2&amp;three=value3

Otherwise the document it in is not valid (or not well formed)


  I know that url escaping allows for %26 to be used instead of ampersands
 - but apparently not for the separator, this needs to be an actual
  ampersand.

Yes, you don't want url escaping here, that would be used to get a & _into_ the value rather than be a separator.



  What is the usual technique to create the query string in xsl?

Just go with the flow and let XSLT force you into doing the right thing.

To summarise,

If you type:


?one=value1&amp;two=value2&amp;three=value3

directly into IE's address bar, it will fail. However, you can safely use it for the src attribute:

<img src="http://1.2.3.4?img=foo.gif&amp;test=true"/> 

46.

AVT escaping

Abel Braaksma



What is the cleanest way to code the following javascript code in my XSL:

<a href="#" onlick="Position.clone('ElementA', 'ElementB', {setHeight:
false, setWidth: false})">-click-<a/>

I think you mean to want to know how to escape the { and } characters in an Attribute Value Template, correct? You can do so by using {{ and }} respectively:

<a href="#" onclick="Position.clone('{$ElementA}', '{$ElementB}',
{{setheight: false, setWidth: false }})">-click me-</a>

47.

Getting the Base Character from a Character with Diacritic

Abel Braaksma


> Is there a way in xslt for me to get the base character of a character
> with diacritic?
> Like Âàto Ẫ? I was thinking of using the translate function, but it
> there are too many characters to include.

This, I think, is far from trivial, but here's a possible approach you might want to consider.

First of all, a character with diacritic does not necessarily need to be encoded as one character. For instance, the character Ẫ can be encoded as x1EAA, as x0041x02C2x0303 (A + ^ + ~), as x00C3x0303 (Ã̃ Ã + ~) or as x00C2x02C2 (Â˂ + ^), where the ~ and ^ are not the characters on your keyboard, but the combining diacritical marks. Also, depending on how you look at it, Æ· (x00C6) equals (is combatible with) AE.

To normalize this Unicode stuff, the Unicode Consortium has invented four (or five?) normalization algorithms. These infamous algorithms either try to decompose (D) as much as possible, or compose (C) as much as possible. Furthermore, they can try to (de)compose even further to the compatible (K) counterparts of the characters. This gives four variants of Normalization Forms: NFC, NFKC, NFD, NFKD (see unicode.org). The XPath 2.0 function normalize-unicode is used for dealing with this and also adds a fifth variant: fully-normalized, which is NFC plus not starting with a combining character.

That's for the theory. Now practice. Processors only need to support one normalization form, namely NFC. This is needed so you can correctly compare two strings. What you need, is NFKD (or, to a lesser extend, NFD). Then you can translate these diacritical marks to something (x02C2, x0300, x0301m x0303, x0309, x0329, x2C9 for a start: circumflex, grave, aigu, tilde, hook, dot, macron) and you have a string without diacritical marks.

Not sure how you tend to deal with Chinese, Hangul, Hebrew, Cyrillic and some other more complex languages, because the combining characters are an important part of the letter, without which they mean nothing. This depends on what you are actually after.

Abel later corrected himself, explaining

In a way, I thought that the Modifier characters (x02B0 - x02FF) could also be used for "modifying" a certain character (I mentioned the macron and circumflex in an earlier post as 0x02C9 and 0x02C6, but these were wrong). They do include macron, diaeresis, circumflex etc. but as I understand now, these are not used for "modifying/combining letters" but for "modifying spacing" (i.e: quotes etc), and as a result do not influence normalization.

Colin Adams adds

The nearest thing I can think of is (XSLT 2.0 only) to call the normalize-unicode function specifying NFKD as the normalization form.

You will then have a string where the base characters and the accents are adjacent characters.

You could then write an xsl:function to drop non-ASCII characters

MK follows with

Following up on suggestions from others, if NFKD is supported then the following should work reasonably well for European languages:

replace(normalize-unicode($in, 'NFKD'), '[&#x0300;-&#x036F;]', '')

or if you prefer

codepoints-to-string(string-to-codepoints(normalize-unicode($in,
'NFKD'))[not(. = 768 to 879)])

Abel closes with

I thought of using the following as regular expression, when you find Michael Key's solution sufficient. It does about the same, but I consider it slightly more readable (opinions may vary ;-)

replace($yourtext,
'(\p{IsCombiningDiacriticalMarks}|\p{IsCombiningMarksforSymbols})', '')

The following is probably the most-complete definition of chars that can be used for combining different characters. Just for the sake of completeness, I copy it here (note: this is Unicode 3.1, because the XML spec only requires Unicode 3.1 conformance. Codes added in 3.2 and 4.0, like x034B-034F, are not in this list):

CombiningChar       ::=       [#x0300-#x0345] | [#x0360-#x0361] |
[#x0483-#x0486] | [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] |
#x05BF | [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670 |
[#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8] |
[#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C] | #x094D |
[#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983] | #x09BC | #x09BE |
#x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8] | [#x09CB-#x09CD] | #x09D7 |
[#x09E2-#x09E3] | #x0A02 | #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] |
[#x0A47-#x0A48] | [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] |
#x0ABC | [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] |
[#x0B01-#x0B03] | #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] |
[#x0B4B-#x0B4D] | [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] |
[#x0BC6-#x0BC8] | [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] |
[#x0C3E-#x0C44] | [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] |
[#x0C82-#x0C83] | [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] |
[#x0CD5-#x0CD6] | [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] |
[#x0D4A-#x0D4D] | #x0D57 | #x0E31 | [#x0E34-#x0E3A] | [#x0E47-#x0E4E] |
#x0EB1 | [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] | [#x0EC8-#x0ECD] |
[#x0F18-#x0F19] | #x0F35 | #x0F37 | #x0F39 | #x0F3E | #x0F3F |
[#x0F71-#x0F84] | [#x0F86-#x0F8B] | [#x0F90-#x0F95] | #x0F97 |
[#x0F99-#x0FAD] | [#x0FB1-#x0FB7] | #x0FB9 | [#x20D0-#x20DC] | #x20E1 |
[#x302A-#x302F] | #x3099 | #x309A

In general: getting the diacritics out is easy with the normalize-unicode function and/or the string-to-codepoints/codepoints-to-string conversion (alternatively, you can use a regular expression with one of the named classes).