Guide to managing special characters from an XML source via XSLT, including: Ampersand, left angle bracket, how to verify display, UTF-8 guidelines, Additional links and information

Special Characters

Introduction

Revision: 4, 4 March 2000

Editor comment:In response to the number of requests for help on managing special characters from an XML source via XSLT I have collated some notes for users with the help of Mike Kay and David Carlisle on the XSL list and John Cowan. They are the the clearest definitions I have... until someone mails me a better one. <grin/>

If something is confusing, then perhaps it should be fixed. Please let me know.

Maintained by DaveP

Special Character handling using XSLT.

Is there a FAQ or a list of all the numbered characters and what the codes refer to?

The current definition of what a character is and what characters are allowed in an XML (XML, XSLT, XHTML, etc) document can be found at: W3C

Specifically the characters come from the Universal Character Set defined by ISO/IEC 10646. Since the I in ISO stands for Ivory Tower, you can't actually look at the ISO/IEC 10646 standard without paying about US $400 for the hard copy and its relevant amendments. Lucky for us, though, the Unicode Standard parallels ISO/IEC 10646, and it is available at a more affordable price -- see Amazon.com

...plus its character charts for the U-00000000 through U-0000FFFD range are available online for free, at the Unicode site in (PDF) and (HTML) .

Paul Caton noted: I find Alan Wood's Unicode pages particularly helpful. Its at his web site

Tony Graham adds: See "Where is my character?" for hints about where to look to find the Unicode code point for a character. He later gave me a link to the Unicode glossary, very useful.

I've added a small background section further down.

Selecting the character you want.

In order to select any non ASCII ( greater than decimal 127) character, look at the Unicode site and find the Unicode number corresponding to the character required.

What is Unicode? It's a 16-bit character code standard (also known as ISO 10646) which is intended to provide character codes for, in principle, all the world's written languages.

For Example: At the first block The character left angle bracket (LESS-THAN SIGN ) has the code #3c

At another block The character for the Japanese Yen sign (YEN SIGN) has code #a5

Caution!

Remember that some characters are 'special' to XML, hence if you don't want your document content to be interpreted as markup then you must treat the following characters in the same way as 'foreign' characters:

The Ampersand sign: &
The Left angle bracket sign: <

Enter the character into your XML source or XSL stylesheet (for literals). This can either be as an entity within either of the DTD's parts (internal or external)or as an character entity reference (decimal or hexadecimal) in context.

Example 1: within the XML DTD


<!DOCTYPE document  SYSTEM "document.dtd" [ 
<!ENTITY nbsp "&#160;"> 
]>

Now I can use this within my XML document by referring to it using the entity.


<para>Item:&nbsp; &nbsp;&nbsp;And its content </para>

Note that the # indicates that the value is decimal. With the addition of an x after the hash symbol that it would be interpreted as hexadecimal. At the url given, all the character numbers are in hexadecimal, hence will need the hash sign and x.

Example 2: within the XSLT stylesheet


<!DOCTYPE xsl:stylesheet [
<!ENTITY nbsp "&#160;"> 
]>

Example 3: in context

  <xsl:text>&#xA0;  </xsl:text>

How to verify if 'it will display'

If the character you want won't display on the browsers you are targetting, it may be necessary to use an image. I'll leave that to the user, but please ensure that you use the alt attribute of the image as in: < img src="pig.gif" alt="Textual description of the image"/>

If you're displaying words or sentences rather than individual characters it's best to combine them into a single GIF image. The simplest approach is to use a machine with the relevant fonts (and a person who can read them!), then type the text straight into your image editing tool, export as a gif image and refer to that in your output.

This may occur if you have to display text in languages such as Urdu, and with the wide variety of old browsers out there it's the only practical approach. On the other hand, if you're only displaying the stuff in locations where you have control over the client configuration, you can do much better.

Note:If you're a Japanese supplier publishing Japanese language information to Japanese consumers then you can be reasonably sure your target audience will have installed a browser that can handle Japanese text!

Generating XML from the keyboard.

Most modern systems in Western Europe and North America (with the exception of the Mac - I always forget that one) tend to use the ISO 8859/1 character set. This not the case for the remainder of the world. The best solution is for the sending system to package the file as XML and specify what its encoding is. For exactitude, Windows pc's use Windows ANSI which is a superset of Latin-1, adding extra characters in the range 128 to 159, most importantly the euro symbol.

Don't be tempted to use keyboard characters for other than the basic character set. It may display well on your system, but there is no guarantee that by the time it has reached someone else's browser it will be readable. On windows, for example, don't use the ALT-0123 method of getting the character you want unless you let the XML parser know what encoding you are going to use, and the parser understands that encoding.

In XML if you specify

<?xml version="1.0" encoding="iso-8859-1"?>

then you are saying that the file uses the character encoding Latin-1 so the parser (if it understands that encoding) will translate all incoming characters to Unicode character data. So it is just as `safe' to use Latin1 as ASCII or UTF-8. Every XML file unambiguously declares what encoding it uses. XSLT when it gets the input tree (and the parsed stylesheet) from the XML parser does not see what was in the original file, it just sees the Unicode e-acute character whether it was entered in Latin-1 or Windows code page, or &#123; or &eacute;.

If you want to use this method, just make sure that your XML document begins with

<?xml version="1.0" encoding="8859-1"?>

or XSLT will assume you have a UTF-8 document and will blow up on the accented characters.

However if the Unicode combining character is used and the input file has e' (where ' is really the combining acute character) then while any Unicode aware renderer is supposed to make this into an e acute for rendering, to an XML engine it is two characters, e and acute. (See later section on multi-byte characters.)

The important thing isn't how you enter the character, it's matching the encoding of the file that you enter it into with your entry method. This depends on the version of Windows you are using, and probably on the text editor as well. Simple text files in Windows, in Western Europe and North America, are typically in ISO 8859-1 code (Microsoft call it ANSI code but they're wrong), regardless of the keyboard you use. If you use a Mac, it's probably MacRoman, which no-one else understands. If in doubt, stick to ASCII characters, and use character entity references for anything else.

Within an XML file, specify the encoding you want when declaring the file to be XML, using the encoding attribute:

<?xml version="1.0" encoding="utf-8" ?>

This states that the document character set is UTF-8.

UTF-8 is essentially a way of compressing Unicode so on average it takes up much less than 16 bits per character. It also has the property that characters in the ASCII set are given their ASCII code value, so a file that consists predominantly of ASCII characters is easily readable, and can often be processed by software designed to handle ASCII. But accented letters do not look the same in UTF-8 and ISO 8859-1, so software designed for ISO 8859-1 (such as Windows Notepad) will not display a UTF-8 file containing such characters correctly. Conversely, characters entered from a system using 8859-1 will not be handled correctly within an XML world which uses UTF-8.

Unicode and UTF-8, a brief excursion.

If Tony Graham permits, I will add a glossary to this document, using some of his definitions. Permission sought, not yet received.

Unicode is a character repertoire not a font encoding. If your keyboard is Latin1 or Latin2 or Japanese shift-jis you can type that character data straight into an xml file as long as the xml declaration specifies the right encoding. As long as the characters in the encoding you use are in the Unicode tables somewhere it will be alright. (although a given XML system isn't forced to understand any encodings other than UTF-8 and UTF-16)

A character repertoire is the set of named 'things' by which I choose to name the characters I talk about. That is, a semantic interpretation which will differ across the worlds languages. Each nation has its own 'character repertoire'; for example: A, Alpha and the cyrillic A are three different Unicode characters, but normally the same glyph or visual appearance. Actually unicode is more than just the character repertoire, it is an encoded character set (i.e. it does assign numbers) but it is not a character encoding in the sense of Latin-1 or Windows-ANSI

Unicode is expressly (some would say religously) not about glyph forms. So two things that are different but look the same like A and greek capital alpha have different slots, but things that look different but mean the same (like any number of fancy ampersand characters) all have the same slot.

Context sensitive reading of character data is what Unicode was built to avoid. A Russian user hits an A on his keyboard and whatever local encoding he uses, if Unicode is used, it arrives on your machine with that character unambiguously marked as a Cyrillic letter. Whether or not your systen can show you that is another matter!

Note if I receive an XML file with Latin2 encoded data I can't look at that file with a normal command line tool, but any XML system will read it in and then the internal parsed tree displaying it will not have any indication of the original encoding. (which is why files tend to go in to XSL in Latin1 and come out UTF-8.

There is a Unicode editor Its still beta testing (4/00), but looks fairly complete. [Update: 2004. I now have a licence for the editor. Its very very good] Full character map, hence can paste those oddities that you can't pronounce, never mind recognise! Can convert to xml entity references in hex or decimal

Upper case Unicode?

Tony Graham lets us know about upper casing developments in the Unicode world.

Uppercase, lowercase, titlecase, and a file of exceptions.

The "General Category" field of the UnicodeData.txt file [2] in the Unicode Character Database [3] for any version of Unicode categorises each character. The allowed values include values for "Letter, Uppercase", "Letter, Lowercase", and "Letter, Titlecase" (as well as "Letter, Modifier", "Letter, Other", and lots of non-letter category identifiers).

There's also a SpecialCasing.txt file in the Unicode Character Database for the many:1 and 1:many case mappings (such as the lowercase 'ß' mapping to the two uppercase characters 'SS') and locale-specific mappings (such as the lowercase mapping of 'I' to ı, LATIN SMALL LETTER DOTLESS I, in Turkish and Azerbaijani).

Unicode Technical Report #21, Case Mappings, [1] provides more information.

What's this mean for XSL?

All of this is transparent to XML/XSLT processing, though, which works in terms of characters, which are not the same as bytes. (The first lesson of i18n: don't assume one character = one byte any more!)

However it affects XSL as follows. If you want to change `1' into a letter you can do it with translate() as long as the `letter' you want to translate to is a single Unicode character. If the accented letter does not exist as a pre-formed Unicode character and is encoded in unicode as a base letter followed by a combining accent then you can not use translate() and have to use the slower named template approach to search and replace substrings.

XSLT output encoding.

<xsl:output type="text" encoding="utf-8"/>

The encoding attribute specifies the preferred character encoding that the XSLT processor should use to encode sequences of characters as sequences of bytes; note the use of 'preferred'. It is not a requirement to do any such conversion. utf-8 is a good choice if you are unsure.

The default XML encoding is UTF-8, and in UTF-8 position 160 is encoded as two bytes. If you look at the file on a terminal using Latin1 or ISO 8859-1 then you see the two bytes as two random characters, but a utf-8 terminal or a browser which understands utf-8 (which is most of the current versions of the main browsers) should do the right thing and show it as a non breaking space. If you output &#160; you are outputting the single unicode character 160 (even if it is two bytes in utf-8)

If you output &amp;nbsp; then you are outputting the six characters `ampersand' n b s p ;

Note that you can't use character references (ie & syntax) in XML names (elements or attributes) so if you want your xml to have french element names you have to use character data not &#123; Hence <element&#123;> is neither well formed nor valid XML.

Multi-byte Multi-character Unicode Characters

In UTF-8, ASCII characters get one byte, the next 1664 Unicode characters (including the whole Latin-1 set) get two bytes, and all others get three. Most texts in languages that use the extended Latin, Greek, Cyrillic, Hebrew, Armenian, and Arabic alphabets require at most two bytes; the other scripts (mostly from Asia) require three.

These should be treated the same as the Latin-1 characters: you either use a charset like UTF-8 that can handle them, or you enter them as character references like &#x039E; Ξ (GREEK CAPITAL LETTER XI).

The idea here is that there are two ways to write characters like "é" (that's e with acute accent). You can use either a single "precomposed" character, the same way that Latin-1 does it, or you can use a "base" character, a plain "e", followed by the character COMBINING ACUTE ACCENT (&#x0301;) .

The base and combining characters are two separate characters for processing purposes, but are combined into a single glyph when displayed.

The Web standard is to use the precomposed form when possible. However, if you want "f with acute accent" you will find there is no such precomposed character in Unicode, so you need a plain "f" followed by &#x0301;.

Browser Characteristics

Netscape 4.0 has a bug: it will not understand references to XML entities unless the charset is set to UTF-8!

Note that XML (including XSLT) and HTML 4.0 have exactly the same rules about these things. The only differences are that most HTML browsers don't support the hexadecimal form &#x039E; so you must write &#926; instead, and that HTML defines a bunch of names for useful characters, whereas XML makes you define them yourself, or else import them from: Latin 1 Specials Symbols

Here's a canned recipe for importing them all, to be placed at the top of an XSLT script:



<!DOCTYPE xsl:stylesheet  [
  <!ENTITY % xhtml-lat1 SYSTEM
     "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent">
  <!ENTITY % xhtml-special SYSTEM
     "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
  <!ENTITY % xhtml-symbol SYSTEM
     "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">
  %xhtml-lat1;
  %xhtml-special;
  %xhtml-symbol;
  ]>

Alas, not all XSLT implementations are guaranteed to process these "external parameter entities" (depending on the underlying XML parser used in the XSLT software), in which case uses like will generate low-level parser errors.

Additional information

What do I need to add to the dtd to use Unicode?

In XML: nothing, since XML already uses Unicode as its character set.

In SGML you need to use an SGML declaration that sets up the document character set to be Unicode. The easiest way to do that is simply to borrow HTML 4.01's SGML declaration: declaration

How do I represent special characters in the mark-up in Unicode?

You have several different alternatives:

use some Unicode encoding to store your documents in and simply write the characters straight in (whatever that means in your editor) as if they were normal characters (which they are, in that context)
use character references, such as &#12927;, to refer directly to the character number of the character you want (note that when talking about Unicode one almost exclusively write character numbers in hexadecimal, whereas &#????; requires a decimal number)
define mnemonic entity references for your characters, and let them resolve to character references. Many well-known sets of such entity declarations already exist and can be reused. See HTML 4.01 and also Robin Cover's site.

Does Unicode replace using ISO character sets?

Which ones? ISO has defined lots of character sets, and ISO 10646 is a character set that is identical to Unicode. If you mean the ISO 8859 character sets, then I think it probably will with time, but for the moment most people (outside Asia) simply use ISO 8859-x and are happy with that.



> I know I can type the yen sign from my win95 keyboard
> with ¥  (it shows here :-) (ALT 0165 ). If I do this into
> an XML file as above, specifying 8895-1, guessing that
> this particular character is in 8859-1 am I doing
> the same as using &#xxx; or whatever the unicode
> number is for the same 'character'?

Yes, there is absolutely no difference, provided the XML file is tagged as 8859-1. The English-language Windows character set, CP-1252, is a superset of Latin-1, and as long as you don't use certain characters (the ones typed with ALT 0128 through ALT 0159) all is well.

BTW, since one types YEN SIGN with ALT 0165, that means the equivalent character reference is &#165; (no x). Convenient.

> For Europeans used to simply typing a non ASCII
> character, I feel this is a potential source of trouble.

On the contrary: it is a source of strength. You can routinely use such characters, even in names of elements where &#... is forbidden. So if you are German and want to have an element named "Körper" (meaning "body") in your XML source, you can. (YEN SIGN won't work because it is not a letter or digit.) Of course, all the xsl: elements have fixed names.

Unicode, some background

ASCII was originally defined as a 7 bit code which used the extra bit in each byte for error control. Therefore there were 2^7=128 different codes available. This was fine for the english speaking world (and I am sure that other languages had a local equivalent of ASCII). To accommodate other languages which had a relatively short alphabet a kludge was made via the different forms of encoding. This was that any byte which began with a 0 was deemed to be US ASCII, but if the first bit was a 1 then the document should signal what all characters in the range 128-255 meant by declaring an ISO encoding.

To avoid this messiness Unicode was developed. This was in its most basic form meant to be a form of encoding giving two bytes to each character and therefore it could potentially support 65,536 different letters.

As US ASCII was the most common encoding used in the world it was given the same character numbers of 0 to 127 so letters in the English alphabet are encoded as 00000000 0xxxxxxx. The problem with this was finding a means of keeping backwards compatibility. If I open a text file encoded in Unicode in Notepad on Windows 95 I see" h e l l o w o r l d." Even worse Unix systems use 00000000 as a special character to indicate the end of strings of characters.

Thus another kludge was devised to accomodate this problem whereby a variable number of bytes could be used to store a character. If a byte begins with a 0, it is taken to be a single byte referring to a single character in the original US ASCII character set of 0 to 127. The two remaining problems needing to be sorted are how to signal to the computer the number of bytes used to encode a character and how to show where a character starts (remember that now there is no correspondence between bytes and characters).

To solve the first it was decided that the number of leading 1's indicated how many bytes were used to encode the character, thus 110 indicates that 2 bytes are used, 1110 3 bytes, 11110 4 bytes etc.

To solve the second it was decided that the first two bits in any byte that were part of the encoding of a single character should be 10.This gives the following patterns:

One byte        0xxxxxxx        7 bits available for encoding
2 bytes         110xxxxx 10xxxxxx  11 bits available for encoding
3 bytes         1110xxxx 10xxxxxx 10xxxxxx 16 bits available for encoding

This is what UTF-8 is. The only problem to its adoption is that it means that most western languages need 2 bytes to encode letters whereas they can get by using one with other encodings, and asian languages need 3 but by using straight Unicode (or other local encodings) they could get away with 2.

If you look on the IBM alphaworks website you can find a programme which will convert files from one encoding to another.

Conclusion

By following these guidelines, you will have a better chance of producing browser output that matches your intention. Its not unreasonable for these ideas to be usable for print output, always assuming that your output system understands Unicode.

References

Unicode, A primer, by Tony Graham. 2000, Wrox Press. Just what it says! See Amazon.com
A Unicode Editor
Some introductory material on characters and character encodings. This model provides authors of specifications, software developers, and content developers a common reference for inter-operable text manipulations on the World Wide Web. Topics addressed include encoding identification, early uniform normalization, string identity matching, string indexing, and URI conventions, building on the Universal Character Set (UCS).
A site I found useful in researching this stuff:
(UTF-8 and Unicode FAQ for Unix/Linux)
RFC2279 (UTF-8, a transformation format of ISO 10646)
A good explanatory site