xml and utf-8, encoding

UTF-8

1. Does HTML know UTF-8
2. Byte order Mark
3. Displaying Unicode characters.
4. Byte order Mark

1.

Does HTML know UTF-8

Mike Brown

Like XML, HTML 4.0 is mostly defined in terms of UCS/Unicode characters, which of course must be encoded. There is a mechanism for a document to signal its own character encoding via a META declaration. This could be overridden by a charset parameter in an HTTP Content-Type header.

Since HTML doesn't prescribe UTF-8 as a default and because the META declaration can appear pretty far down in the document HEAD, the recommendation states that only ASCII (U+0000 through U+007F) characters should be used in the document up to that point.

This stuff is discussed at W3C

It is worth pointing out that the value of the recommendation is only as good as the user agents' support for it. The 4.0 browsers seem to do okay with automatically selecting the proper encoding when interpreting a document, but you may have noticed that they also let the user manually choose it even if the document signaled its own encoding.

2.

Byte order Mark

Tony Graham

It's always been possible to use  with UTF-8 in XML. It just wasn't mentioned in the XML Recommendation (and still isn't all that explicit).

ISO/IEC 10646-1:1993 has "always" supported use of ZERO WIDTH NO-BREAK SPACE () as an encoding signature for UTF-8 (where "always" probably means "since UTF-8 was added to ISO/IEC 10646-1:1993 as Amendment 2 some time before there was a Unicode 2.0").

The Unicode side of the Unicode==ISO/IEC 10646 equation was ambivalent (at best) about  as an encoding signature for UTF-8 for quite a long time after ISO/IEC 10646 blessed the idea, but the signature is now listed as such in Section 13.6, Specials, of the Unicode Standard, Version 3.0.

3.

Displaying Unicode characters.

Thomas B. Passin


 > It would depend on the User Agent, not the platform. 
 > If  this is actually
 > true for any "recent" version of IE (let's say, since 4.0), 
 > I'd like to see
 > some evidence before I believe it :-)
 

I just did an experiment that verified what each of us said. I created an xml file on my Windows 2000 machine with a ® in it. I transformed it with an identity transform twice, first with encoding='utf-8 and second with encoding='iso-8859-1'.

Looking at the hex bytes, the iso results contained a hex AE byte, which is correct for character 174. The utf-8 results contained the two hex characters C2 AE, which I presume is right for utf-8. Both results displayed the registered trademark symbol, the one with the the r in a circle.

I copied the results to a floppy and took it over to my Win95/SP2 computer, then displayed the results in IE 5.5. Both files displayed the same, showing the right symbol. This is what you said would happen.

I also loaded each result into Notepad on Win95. Notepad displayed the iso file correctly, but not the utf-8 result (it showed that "A" character with a little circle above it), ahead of the trademark symbol. This is what I was suggesting would happen. BTW, Notepad on the Win2000 computer did display both results correctly.

Summarizing, what you will see displayed for high-order characters can depend on the encoding, OS, and the viewing program. On older versions of Windows, at least, non-browsers are likely to display the wrong thing.

In fact, even on my Win2000 machine, using XML Cooktop to run and display the transformation gave an incorrect display (and it uses the IE activeX control to display the results!), so you can't be sure even on Win2000 that high order characters will display the intended way, depending on the app.

Try it yourself on your system. Here are the files:

 <?xml version='1.0' encoding='utf-8'?>
 <data>Here is a ==&#174;== character</data>
 
lt;xsl:stylesheet version="1.0"
     xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output encoding='utf-8'/><!--Or change to iso-8859-1-->
 
 <!-- Identity transformation template -->
 <xsl:template match='*|@*'>
  <xsl:copy>
   <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
 </xsl:template>
 
 </xsl:stylesheet>

4.

Byte order Mark

Michael Beddow



> Since the lower end of Unicode is congruent with ASCII etc.,
> editors sometimes fail to preserve it. The BOM that would serve as an
> unambiguous "signature" for UTF-8 is optional, apparently. Not only that,
> but since some browsers (e.g. NN 4~) choke on the BOM in UTF-8, some
> editors leave it off, relying on heuristics (that can fail) to determine
> whether something is supposed to be UTF-8, ISO-8859 etc.

I hate to defend NN4 under any circumstances, but even such a justly revered parser as expat, (at least in the form James Clark handed it on) throws an error if a utf-8 encoded document begins with a BOM.

[For the perplexed: documents encoded in representations of Unicode that use multi-byte (16 or 32-bit) values need to start with an indicator that tells the processor whether the high or low-order byte comes first: like the Lilliputians, computer manufacturers are split into big- and little-endian factions. This indicator was first called a "signature" in ISO usage and a Byte Order Marker (BOM) in Unicode, though the "signature" term has now migrated into Unicode documentation. Since utf-8 represents code-points as sequences of discrete 8-bit values, the "endian-ness" problem doesn't arise.]

What is sometimes called the "first edition" of the xml spec (REC-xml-19980210) according to which the first parsers were built, makes no mention of a BOM with utf-8. It looks as though the authors at this stage took the line that a byte order marker made no sense in an encoding where endian-ness was not an issue, and only later realised its utility as a "signature" to signal utf-8, even where its orginal ISO/Unicode function of declaring byte order was not required. The so-called "second edition" (REC-xml-20001006) does provide (in revised Appendix F) for the use of a BOM in utf-8. Hence the unclarity about what a conformant parser should do if a utf-8 document begins with a BOM. It didn't help matters that Microsoft in their XML implementations decided the utf-8 "BOM" should be there, whereupon vocal MS-phobes worldwide immediately began declaring that it most certainly should not.