XML entities, XSLT entities

Entities

1. How can I pass entities through to the output of a transform
2. Lost with entities
3. Entity sets available
4. Lost with entities
5. How can I output an nbsp within a template
6. Can I have a DOCTYPE without having a DTD
7. Unknown entities problem
8. Resolving entities
9. Using Entity References in XSL Templates
10. Character entities appear as garbage
11. Viewing entities in XSLT output
12. Entities definitions not available
13. ISO Entity sets
14. Redeclaring built-in entities
15. Unparsed entities and public identifiers
16. Passing entities through a transform
17. Entities as numeric references?
18. How to get an Numerical Character Reference in the output?
19. Entity definitions
20. Entities not found
21. entity problems, character sets and encodings.
22. Keeping entities through a transform
23. Switching off character entity resolution in XSL
24. Numerical character reference pass-through
25. Recognising entities
26. How can I pass entities through to the output of a transform

1.

How can I pass entities through to the output of a transform

Andrew Welch


Bottom line, you can't. 

But! By capturing input events and modifying the source the entity information can be put through to the output for further processing. See this for the detail

Converting cdata sections into markup
Preserving entity references
Preserving the doctype declaration
Marking up comments

Quite versatile.

2.

Lost with entities

David Carlisle



  I don't understand anything,

  I tried with this entity <!DOCTYPE
  xsl:stylesheet [<!ENTITY euro "&#8364;">]> and with &#x20AC;

Don't complicate issues by introducing doctypes and entity declarations. If you get it all working and afterwards you want to make "& euro;" a macro for whatever works, then you can do that but it is irrelevant to your problem.

the way to get a euro is to go &#8364; or &#x20AC; which are equivalent.

that puts unicode character 8364 into your document.

You then transform your document however you like with xslt to produce a result document.

You then (perhaps) serialise that result tree to a sequence of bytes to form an XML file in some encoding, and then you pass this document to some system such as a browser to display the document.

Starting from the end.

* If you haven't got a font installed which has a euro symbol then that symbol is not going to display whatever you do. (This may sound like stating the obvious but its probably one of the more common reasons for problems in this area)

* If you have not got a font (or font system) set up which knows that it should map the unicode slot 8364 to a font that has a euro in it then again you are going to see some kind of missing glyph marker.

Both of those problems are really out of scope for this list, and hard to debug at a distance anyway.

Your document may be written out in some encoding (eg the encoding specified in xsl:output, or a default such as utf8) but your browser may be expecting a different encoding (eg a default such as latin 1) This problem which can be solved by fixing your browser to recognise the encoding, or avoided by getting xslt to write in the encoding expected by your browser was the root cause of the long thread about "nbsp" earlier in the week.

Note its very hard to tell if this is your problem, partly because if you just put bytes representing characters into your mail they may be changed to other bytes representing (hopefully the same) character in a different encoding by the time the person reads it.

for example the line of your message


  when I change this manually I see exactly the "EUR" character.

I see as

when I change this manually I see exactly the "\200" character. although I can tell that the character shown on my screen as \200 is a single character. I expect though that you saw something else. It is though octal 200 which is hex 80 which is a control character and not a printable character at all, how did that get there?

Are you producing html or something else? Can you hand write an html file that contains a euro that displays correctly?

If the answer to both those is yes, than we can probably tell you what to do in xslt to generate html of the form that works when you write it by hand.

3.

Entity sets available

David Carlisle

There is now a collection of entity definitions hosted at the W3C at w3c.org

This is a mixture of an update to the existing data at http://www.w3.org/Math/characters with most of the text rephrased to be less mathml-specific, together with a new draft text of an update to ISO/IEC TR 9573 To include Unicode (ISO 10646) definitions of the entities rather than SGML SDATA entities.

All the data and scripts to produce the site are also available, linked from the overview page.

Currently the definitions are identical to the definitions in the forthcoming MathML 2 2nd edition PR draft.

The draft ISO/IEC DTR 9573 contains tables detailing differences between these current definitions and the defintitions used by Docbook, HTML, and the Stix Consortium.

It is hoped that this _draft_ set of definitions might form the basis of a shared, compatible set of definitions between different XML languages so that the current situation where <mo> & assymp; </mo> changes meaning if it is copied from a docbook+mathml document to a xhtml+mathml document might be avoided...

Comments welcome ( on xml-dev, or on www-math@w3.org).

4.

Lost with entities

David Carlisle



  I don't understand anything,

  I tried with this entity <!DOCTYPE
  xsl:stylesheet [<!ENTITY euro "&#8364;">]> and with &#x20AC;

Don't complicate issues by introducing doctypes and entity declarations. If you get it all working and afterwards you want to make "& euro;" a macro for whatever works, then you can do that but it is irrelevant to your problem.

the way to get a euro is to go &#8364; or &#x20AC; which are equivalent.

that puts unicode character 8364 into your document.

You then transform your document however you like with xslt to produce a result document.

You then (perhaps) serialise that result tree to a sequence of bytes to form an XML file in some encoding, and then you pass this document to some system such as a browser to display the document.

Starting from the end.

* If you haven't got a font installed which has a euro symbol then that symbol is not going to display whatever you do. (This may sound like stating the obvious but its probably one of the more common reasons for problems in this area)

* If you have not got a font (or font system) set up which knows that it should map the unicode slot 8364 to a font that has a euro in it then again you are going to see some kind of missing glyph marker.

Both of those problems are really out of scope for this list, and hard to debug at a distance anyway.

Your document may be written out in some encoding (eg the encoding specified in xsl:output, or a default such as utf8) but your browser may be expecting a different encoding (eg a default such as latin 1) This problem which can be solved by fixing your browser to recognise the encoding, or avoided by getting xslt to write in the encoding expected by your browser was the root cause of the long thread about "nbsp" earlier in the week.

Note its very hard to tell if this is your problem, partly because if you just put bytes representing characters into your mail they may be changed to other bytes representing (hopefully the same) character in a different encoding by the time the person reads it.

for example the line of your message

when I change this manually I see exactly the "EUR" character.

I see as

when I change this manually I see exactly the "\200" character. although I can tell that the character shown on my screen as \200 is a single character. I expect though that you saw something else. It is though octal 200 which is hex 80 which is a control character and not a printable character at all, how did that get there?

Are you producing html or something else? Can you hand write an html file that contains a euro that displays correctly?

If the answer to both those is yes, than we can probably tell you what to do in xslt to generate html of the form that works when you write it by hand.

5.

How can I output an nbsp within a template

Tony Graham and others.

Use &#xA0; (or &#160;) The code value in Unicode for NO-BREAK SPACE is U+0020. or Define &nbsp; by including it in a DOCTYPE thing at the start of the stylesheet:

<!DOCTYPE xsl:stylesheet [<!ENTITY nbsp "&#160;">]>

Or you could just use &#160; within the template, but obviously that is less readable.

6.

Can I have a DOCTYPE without having a DTD

G. Ken Holman

>As I understand it, entities have to be declared inside the DOCTYPE. But 
>my stylesheets don't have a DTD. Can I have a DOCTYPE without having a DTD?

It isn't a problem to have a DOCTYPE without a DTD.

When I work with the text output method, I usually create a &nl; entity to emit line feeds in my output as follows:

<?xml version="1.0"?>
<!DOCTYPE xsl:stylesheet [
<!ENTITY nl "&#xd;&#xa;">
]>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version="1.0">

<xsl:output method="text"/>

<xsl:template match="/">
   <xsl:text>&nl;</xsl:text>
</xsl:template>

</xsl:stylesheet>

You can also have partial DTDs ... a common XSLT processing requirement for well-formed documents is to recognize which attribute of elements is of type ID (because there is nothing special conferred on attributes that are *named* ID).

To communicate this information regarding the source file to the stylesheet, the following can be added to the source:

<!DOCTYPE custsummary [
<!ATTLIST cust custNbr ID #REQUIRED>
]>
<custsummary>
......

Alexandra Morgan adds:

When using external entities, the URL must contain forward slashes.

E.g. 
<?xml version='1.0'?>
 <!DOCTYPE xsl:stylesheet [
 <!ENTITY header SYSTEM "http://localhost\header.xml">
etc.

            

7.

Unknown entities problem

Sebastian Rahtz

Q expansion: I've got an XML file, that may contain weird Unicode entities (such as &laqno;). Of course the parser crashes because my DTD only contains the most usual Unicode entities.

Has anyone a smarter idea than building a DTD with all Unicodes?

http://www.tug.org/applications/jadetex/unicode.xml contains everything that I have ever discovered, from which you can extract what you want. the real claim to fame of this monster is that it contains all the MathML characters (all recent changes to this file come from David Carlisle, using it for MathML)

David Carlisle added:

If you mean an exhaustive list of the characters in the Unicode character set (which is what the &#xxxx; character references) refer to, you can find that at: <URL: unicode org > In particular grab a copy of UnicodeData-Latest.txt from the unicode site .

Its a pretty invaluable file (usually to be found in one of my emacs buffers:-)

The xsl below will extract an XML compatible entity file (or files) from unicode.xml, just edit it to get the sets you want, as posted it just makes one for ISOPUB from ISO 8879. It uses the xt:document extension for xt. Information for how to make it work with other XSL engines greatfully received. (Vendor neutral extension namespace, perhaps?:-)


 <?xml version="1.0"?>
 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 xmlns:xt="http://www.jclark.com/xt"
                 extension-element-prefixes="xt"
                 version="1.0">
 
 
 <xsl:output
   method="text"
   />
 
 <xsl:template name="alphadecl">
 <xsl:param name="set"/>
 <xsl:variable name="x">
   <xsl:choose>
   <xsl:when test="starts-with($set,'9')">
     <xsl:value-of select="substring-after($set,'13-')"/>
   </xsl:when>
   <xsl:when test="starts-with($set,'8')">
     <xsl:value-of select="substring-after($set,'-')"/>
   </xsl:when>
   <xsl:otherwise>
    <xsl:value-of select="$set"/>
   </xsl:otherwise>
   </xsl:choose>
 </xsl:variable>
 
 <xt:document method="text"  href="{$x}.ent">
   <xsl:for-each select="character/entity[@set=$set]">
     <xsl:sort select="@id"/>
     <xsl:text>&lt;!ENTITY &lt;/xsl:text>
     <xsl:value-of  select="@id"/>
     <xsl:call-template name="pad">
       <xsl:with-param
                name="x"
               select="15-string-length(@id)-string-length(string(../@dec))"/>
     </xsl:call-template>
     <xsl:text> "&amp;#&lt;/xsl:text>
     <xsl:if test="60 = ../@dec or 38 = ../@dec">
       <xsl:text>38;#&lt;/xsl:text>
     </xsl:if>
     <xsl:value-of select="../@dec"/>
     <xsl:text>;" &gt;&lt;!--&lt;/xsl:text>
     <xsl:value-of  select="../@id"/>
     <xsl:text> &lt;/xsl:text>
     <xsl:value-of select="desc"/>
     <xsl:text> --&gt;
&lt;/xsl:text>
   </xsl:for-each>
 </xt:document>
 
 </xsl:template>
 
 
 <xsl:template name="pad">
 <xsl:param name="x"/>
 <xsl:if test="$x &gt; 0">
 <xsl:text> &lt;/xsl:text>
 <xsl:call-template name="pad">
 <xsl:with-param name="x" select="$x - 1"/>
 </xsl:call-template>
 </xsl:if>
 </xsl:template>
 
 <xsl:template match="charlist">
 
 
 
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isoamsa'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isoamsb'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isoamsc'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isoamso'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isoamsr'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isogrk3'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isomfrk'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isomopf'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isomscr'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'9573-13-isotech'"/>
   </xsl:call-template>
 
 
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isobox'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isocyr1'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isocyr2'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isodia'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isogrk1'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isogrk2'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isolat1'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isolat2'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isonum'"/>
   </xsl:call-template>
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'8879-isopub'"/>
   </xsl:call-template>
 
 
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'mmlextra'"/>
   </xsl:call-template>
 
   <xsl:call-template name="alphadecl">
     <xsl:with-param name="set" select="'mmlalias'"/>
   </xsl:call-template>
 
 </xsl:template>
 </xsl:stylesheet>

Rick Geimer answered:

You could just include the entity files from MATHML in your DTD, since they contain unicode mappings from just about all the old iso sgml entity sets. You can download the entites from the W3C at the following URL: http://www.w3.org/TR/REC-MathML/mmlents.zip

8.

Resolving entities

Mike Brown


> does anyone know of any simple way to 
> resolve &, &lt;, >, or other entity
> characters inside of an xml document?
> 
> for example:
> 
> <Company>
>      <Name>Foobar & Goobar</Name>
> </Company>

That wouldn't be entity resolution. You have to have an entity before you can resolve it. What you want is called "cleanup" :) Do it by hand. There's no easy way for a parser to read the document and know for sure that an instance of the &, &lt;, or > characters are intended to be just those characters, or markup. That's why you should make sure that any data you're putting into a PCDATA section has those characters properly encoded as entity references.

9.

Using Entity References in XSL Templates

Ray Arjun

Please note that _character reference_ and _entity reference_ are distinct categories. In particular, the former is not an "entity" at all.

> ... to refer exclusively to code positions in the document's character
> set, while named entities refer to character positions in either
> ISO-8859-1 or UCS, depending on which entity you're referring to.

Not quite. In the entity declarations, the entities are defined in terms of character references. That's OK, because the connection between ISO-8859-1/UCS and the document character set is determined by the SGML declaration.

> In HTML, &nbsp; is always ISO-8859-1 character 
> number 160, i.e. a non-breaking space ... 
> but &#160; is simply character number 160 in the
> character set of the document encoding, 

Oops. Not at all. Absolutely not. The encoding has absolutely nothing, repeat **NOTHING** to do with this. TheI18n spec is required reading:

(It's the job of the "entity manager" to transcode from the encoding to the document character set.)

These references may be helpful also: Mulberry jkorplea schars

10.

Character entities appear as garbage

David Carlisle


> With the output method set to "xml", however, references 
> to certain character entities result in garbage

Are you sure it is garbage? most likely it is just that you are looking at the file with a termianl or editor expecting latin1 encoding, but the default encoding for xml is utf8.

11.

Viewing entities in XSLT output

Mike Brown

> All the &nbsp; from the orginal XML document are translated by XT into unexpected characters.

The non-breaking space characters are being serialized with UTF-8 encoding. The non-breaking space character, U+00A0, is encoded as 2 bytes. Whatever you are using to view the document is not decoding it properly and is showing the 2 bytes as 2 characters.

This could indicate something wrong in your stylesheet if you were expecting to see a character or entity reference. Are you using the text output method, by chance?

Nikolai Grigoriev adds:

XT produces only UTF-8 and ignores encoding specifiers in xsl:output. For every &nbsp, you get C2 A0 - a UTF-8 representation of &#160;

Use SAXON and specify encoding="ISO-8859-1" in xsl:output if you want to get it readable ;-).

12.

Entities definitions not available

David Carlisle

I'm processing a source document (file1)which has its doctype, entities defined etc. Mid stream I use a document() to retrieve and process another xml file (file2) The doctype, local entities I have defined for file1 are completely dissociate from those of file2. And any entities I define within the stylesheet, they are only 'valid for use' within the stylesheet? so on this basis I have 3 document structures, file1 file2 and the stylesheet, each with its own, non-overlapping set of entity definitions.

The point is that document() supplies a node set and similarly the original document and the stylesheet produce node sets (in each case a set of one node, the root node of the respective documents) but to produce a node set you have to parse the things as XML 1.0 documents and that means that the scope of entity declarations has to be as described in XML 1.0. If you have the file

<foo>
&xxx;
</foo>

Then as it stands it is not well formed XML, and so nothing the XSL spec says about document() could change that, the file won't get past an XML parser.

To make it well formed you either have to stick a DOCTYPE at the top of the file which declares xxx or you can include this file into another XML document (that does have such a DOCTYPE) but as an external parsed entity, not as an XML document. External parsed entities ie something declared via <!ENTITY filewithxxx SYSTEM "aaa.xml"> .... &filewithxxx; are expanded by the XML parser as part of building the node tree for XSL so such an inclusion is invisible to XSL it as if the included file had been copied into the including file.

13.

ISO Entity sets

Rick Geimer

Try the excellent mapping provided by Sebastian Rahtz and others at this URL: (SR tells us) in case it is not obvious, the former is derived from the latter entirely automatically.

Mike Brown adds Oasis

Numeric character references are derived from the code values in ISO 10646-1:1993 as prescribed by the XML 1.0 Recommendation, but you would do better to obtain a relatively affordable copy of The Unicode Standard Version 3.0, which is a printed book available through Amazon or any other computer book shop.

Online you will find that the references at macchiato and Unicode Consortium are most helpful in this regard. I also recommend for Windows 9x/NT users, (a better Character Map) in conjunction with the Bitstream Cyberbit font.

As for entity references like &amp;, there are only 5 of them in XML and they are listed in the XML 1.0 Recommendation. They are &amp; &lt; &gt; &quot; &apos; for & < > " ' respectively. All other entities have to be declared in the document's DTD.

14.

Redeclaring built-in entities

Nikolai Grigoriev

The MS parser chokes on the following entity declarations :

<!ENTITY amp    "&#38;">
<!ENTITY lt     "&#60;">

However, I have re-read the XML spec and discovered a special section on how to redeclare the entities (4.6 "Predefined entities"). I realized that I was wrong in copying the declarations blindly from the HTML 4.0 spec. It should be double-escaped like this:

<!ENTITY amp    "&#38;#38;">
<!ENTITY lt     "&#38;#60;">

With this correction, everything works as it should.

What is strange is that neither XT nor Saxon reported that error?

Mike Kay adds It depends of course on what XML parser you are using with XT or Saxon, not on the XSLT processor itself.

I suspect an XML parser should report this as an error. The XML version of the XML specification itself, however, contains the line

<!ENTITY lt     "&lt;">

and most parsers accept it: but IE5 doesn't. I think MS are in the right.

the <xsl:output> element only provides hints to the XSLT processor; and the processor can ignore the hints. And if it does ignore them, you can choose a different XSLT processor.

David C finishes the (confusing) picture by adding: replacement texts of entities are only required to be well formed if they are referenced, strange but true. (xml spec section 2.1, point 3)

It is arguable that the declaration of lt as "&lt;" in the xmlspec does not break this rule as the predefined entity takes precedence. That's what happens with a web sgml system, where lt declared in the sgml declaration for xml acts as if at the beginning of the local subset. The xml spec is rather vague on whetehr predefined entities count for the `first definition wins' rule that applies to normal entity declarations.

The declaration of lt as "&lt;" contravenes the prose constraint that lt must be declared to be a reference to the &lt; character, but that is not, as such, a well formedness constraint.

15.

Unparsed entities and public identifiers

Q expansion. The XSLT Recommendation (Section 3.3, Unparsed Entities) contains the following phrase: "The XSLT processor may use the public identifier to generate a URI for the entity instead of the URI specified in the system identifier." Do any of the extant XSLT processors take advantage of this opportunity?

Paul Grosso

One obvious approach would be a command-line option to specify an OASIS Catalog file, mapping public to system identifiers. Public identifiers could be looked up in this file, with the system identifier being used (as it must) if no match is found for the public identifier.

It's relatively feasible (for a Java programmer) to add catalog support to something like XT. See below for more.

>One obvious approach would be a command-line option to specify an OASIS
>Catalog file, mapping public to system identifiers.  Public identifiers
>could be looked up in this file, with the system identifier being used
>(as it must) if no match is found for the public identifier.

I agree, which is why I've been pushing catalog support whenever I can.

As you mention, the obvious choice is the catalog support defined by the OASIS (SGML Open) Entity Management Catalog Technical Resolution TR9401:1997 [1]. See Norm's article[2] for both a good description of the issues and for a pointer to some Java classes that implement both the TR9401:1997 catalogs and XML Catalogs [3].

16.

Passing entities through a transform

David Carlise (who else :-)

XSLT requires all entities to be expanded. If you don't need to query into the replacement texts of the entities and want to use XSLT, you need to use something that removes the &'s here's a script I use:

 #!/bin/bash
 sed -e "s/&/XXXAMPXXX/g" -e "s/DOCTYPE\\([^&lt;]*\\)>/--DOCTYPE 
 \\1-->/" $1 > $1.tmp
 ../xsl/saxon.sh -o $1.tmp2 $1.tmp ../xsl/enum.xsl 
 sed -e "s/XXXAMPXXX/\&/g" -e "s/--DOCTYPE 
 *\\([^-]*\\)--/DOCTYPE \\1/" $1.tmp2 > $1
 rm $1.tmp $1.tmp2
 

It comments out the DTD reference, and makes all entity references simple text, runs saxon, then uncomments the dtd and puts the entity references back.

17.

Entities as numeric references?

Mike Kay


> Is there any possibility to let the numerical references pass
> transparently?

XML tries - not always successfully - to separate the logical information content from the physical encoding of the file. Certain features of XML are considered to carry information that applications can legitimately use (for example, the names and values of elements and attributes, the order of elements), and other features are considered to be accidents of the encoding (for example, the order of attributes, the whitespace within a start tag, the choice of single or double quotes).

There are some boundary cases such as CDATA sections, comments, and namespace prefixes, where different specs have interpreted the constructs differently in this respect. But the distinction between representing a character literally and representing it by a decimal or hexadecimal character reference is definitely in the "accident of encoding" category, and no application should rely on it being represented one way or the other. Indeed, XML parsers should not even tell the application which way it was encoded.

XSLT works strictly at the level of the logical information content: it is not concerned with retaining the original encoding, though it does allow you some limited control over the output encoding if you use the XSLT serializer.

You haven't really explained why your target application (the one that is the recipient of the output from the transformation) cares which XML representation of the characters in the document is chosen. As David explained, if it does care, then it is not an XML application as defined by the XML specification.

18.

How to get an Numerical Character Reference in the output?

Mike Brown



> In the DocBook input, I have &#8593; (with spaces in case it doesn't go 
> through: & # 8593 ;). This should be an [arrow up].

> I'm transforming the doc to XHTML. In the output, I get the upwards 
> arrow (as one char (â+'), not as NCR), which should be fine so far, but 
> the browsers (Mozilla etc) don't like it: They display garbled stuff 
> like â + '(spaces inserted).

Sounds like you got UTF-8 output but your browsers think it's iso-8859-1.

Either add <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> to the XHTML document, so the browser will interpret the bytes properly, or change the encoding of the output: <xsl:output method="xml" encoding="us-ascii"/> will force numeric character references for all the non-ASCII characters.

19.

Entity definitions

David Carlisle



You could use Sebastian Rahtz's very useful unicode.xml doc at Robins pages (more up-to-date versions may be floating around).

A version up to date for Unicode 3.2 (and a set of entity files derived from it) is available at w3c in particular w3c

(I've sort of stolen unicode.xml from sebastian these days)


> [2] This won't work with hashed character reference entities, 
> as I doubt you
> can redefine them.

This is true. they're not entity references and you can't (re) define them.

20.

Entities not found

Brook Ellingwood

You can't use an HTML entity in XML. You should use the equivalent entity from the UTF-16 character set. This URL is convenient for finding the common character codes: demon.co.uk

This URL has far greater scope if you ever need to find a less common code: alanwood.net

In theory, if you are using a Unicode-aware text editor, you can set the encoding of your file to UTF-16 (or UTF-8) and just type the characters in without using the entities, but I've had pretty lousy luck getting that to work all the way to the browser. For now, you're safer to use the entities.

21.

entity problems, character sets and encodings.

Eliot Kimber



> If you are using the UTF-8  encoding, for example, the ó character is 
> represented by &#243;

Actually, the encoding doesn't matter--what matters is the character set, which is always Unicode for XML.

That is, the character &#243; (Latin small letter o with acute) is that character in the Unicode character set regardless of the encoding.

Characters are an abstraction. A character set is nothing more than an arbitrary mapping of abstract characters to unique numbers by which those characters can be referenced. In Unicode, each character also has a unique name that can be used instead of the character code to refer to the character (although no all processors know how to resolve these names).

The encoding simply determines how the characters are written to disk as sequences of bytes. For example, in UTF-8 encoding this character is written as a single byte (because its code is less than 255, the point at which UTF-8 uses 3 or more bytes per character), but the UTF-16 encoding is written as two bytes because UTF-16 uses two bytes for each of the first 65K characters of the Unicode Basic Multilingual Plane. In both cases the character (the abstraction) is the same: lowercase o with acute.

To read an XML file, the XML processor must first read the sequence of bytes on disk and then interpret that byte sequence as a sequence of characters. Therefore, it must know the encoding because the same sequence of bytes may result in different characters (or be invalid) depending on the encoding it is interpreted as.

22.

Keeping entities through a transform

Richard Light




>Unlike what most people would use XSL for (i.e. conversion of XML to 
>HTML or other output format), I have a requirement to transform from 
>one XML structure to another (subsequent presentation rendering 
>occuring way downstream). No big deal I guess, but the annoying thing 
>here is that by the time an XML parser has done it's job as per the XML 
>specification, all those pesky character entities have been resolved 
>(as defined in the DTD for the source document) and the output contains
square brackets.

I've done this for entities which map to character references, rather than to the SGML-style "SDATA" strings you quote below. My strategy is to live with the fact that the parser has carried out all the entity mappings, and to use a "mappings" document containing entries like this:

<char>
<name>Delta</name>
<value>&#x0394;</value>
<unicode>0394</unicode>
<!--U0394 /Delta capital Delta, Greek --> 
</char>

to reverse the process on output. (For your purposes, all you need is the <name> and <value> elements - the other element types have different uses.)

Essentially, when you come to output text(), iterate through it character by character. Have a convenience variable $normal-chars, e.g.:

<xsl:variable name="normal-chars"
          select="concat('ABCDEFGHIJKLMNOPQRSTUVWXYZ',
                  'abcdefghijklmnopqrstuvwxyz',
                  '0123456789 ',
                  '!$%^*()-_+={}[];:@#~/?.,')"/>

so you can quickly test and output characters which can be output as found. For all others, look up the <char> with the appropriate <value>, and output its <name>:

    <xsl:value-of select="concat('&amp;', $ch-name, ';')"
disable-output-escaping="yes"/>

Yes, it is slow and clumsy, and yes, it does use the deprecated disable-output-escaping, but it does work ...

23.

Switching off character entity resolution in XSL

David Carlisle



  I realise there is some way of dealing with this with character
  substitutions before or after using something like sed, but this isn't
  really a great solution,

Great or not, it's what we all do:-)

sed -e "s/&/[[[AMP]]]/g" file.xml > tmp.xml 
saxon -o tmp2.xml tmp.xml style.xsl 
sed -e "s/[[AMP]]/&/g" tmp2.xml > result.xml

XSLT2 provides a character map functionality which allows characters to be rewritten back as entity references or anything else.

For XSLT1 a system is _allowed_ to do that but probably won't unless it has some mechanism of supplying a non standard serialiser. Many of the java ones do have such an facility I think.

24.

Numerical character reference pass-through

Tom Passin



> Is it reasonable to expect 
> &#160; to be left as &#160; when outputting us-ascii.
> This is obviously the same as expecting the two consecutive bytes 0xC2 
> 0xA0 in a utf-8 input stream to be translated to &#160; for a us-ascii 
> output stream.  I don't see anything wrong with such an expectation.

What you haven't absorbed yet is that there is nothing to be "left as & # 1 6 0 ;". When that xml parser parses the xml file and encounters such a character, it sticks a unicode non-breaking space into the text that it constructs. Later, the xslt processor receives that text and has no way to know that it startedout in life as & # 1 6 0 ;, & n b s p;, or whatever it was.

So when the xslt transformation is done and gets handed off to the serializer, there is no memory of the original source of that character (and how could there be, really - after all, you might have added the character during the transformation). So the serializer has to figure out what to output,and if the encoding does not support a character, how to escape it. The easiest and most reliable thing to do is output an & # x 0 0 ; or & # x 0 0 0 0 ; character reference,so that is what you usually get.

The above is what is "reasonable" to expect - output that character if the encoding supports it, output an appropriate character reference if not, and oh, yes, if you are lucky maybe output the corresponding html entity or maybe not. Any other behavior is nonconforming and incorrect.

25.

Recognising entities

David Carlisle


> is there a way to recognize and filter only elements which text()
> begins with a character entity?

> is there a way to recognize and filter only elements which
> text() begins with a character entity?
>
>   <Element>&amp;This is filtered</Element>
>   <Element>&epsilon;This filtered</Element>
>   <Element>&euro;This is filtered</Element>
>   <Element>This is *not* filtered</Element>
>   <Element>And this also not</Element>

entities are resolved by an XML parser efore XSLt sees the data. So no.

If you need to do that then you need to insert a modified dtd where for example &epsilon; is defined to be <entity name="epsilon"/> then you can look for the element.

MK Adds

Technically these are all "entity references", not "character entities". If you wrote "&#x20ac;", that would be a "character reference".

26.

How can I pass entities through to the output of a transform

Andrew Welch


Bottom line, you can't. 

But! By capturing input events and modifying the source the entity information can be put through to the output for further processing. See this for the detail

Converting cdata sections into markup
Preserving entity references
Preserving the doctype declaration
Marking up comments

Quite versatile.