xslt and whitespace, normalize-space

Normalise-space

1. normalize-space, what does it do?
2. How to normalise strings containing subelement
3. Normalising space in mixed content.
4. Seperating in-line elements
5. Element name normalisation
6. Normalise space

1.

normalize-space, what does it do?

Jeni Tennison



> Using normalize-space is the only way I've found to eliminate the
> newlines, but then I lose the whitespace around the command elements
> is stripped.

That's because normalize-space() does three things:


  - replaces whitespace characters with spaces
  - strips leading and trailing whitespace
  - collapses sequences of whitespace into single spaces

It's the second of these that's causing you to lose whitespace. You only want to do the first to get rid of the newline characters, so instead of using normalize-space(), use the translate() function to replace any 
 or 
 characters with spaces:

<xsl:template match="text()">
  <xsl:value-of select="translate(., '&#xA;&#xD;', '  ')" />
</xsl:template>

> The preserve-space (and strip-space) functions don't seem to have
> any effect, presumably because refpurpose is not a text-only
> element.

The xsl:preserve-space and xsl:strip-space elements govern how whitespace-only text nodes are handled -- whether they're ignored or preserved. In your document, the text nodes you're worried about have other characters in them as well as whitespace, so they won't be affected (although you'd be well advised not to use xsl:strip-space as if you ever did have a whitespace-only text node within your refpurpose, the likelihood is that you'd want to keep it).

2.

How to normalise strings containing subelement

David Carlisle

Note:Entities are no problem as they are all expanded before XSL sees the document.

Normalising space in mixed content is _always_ a problem. Of course normally you just don't do it, as the renderer (eg html in this case) can more easily do it as it lays out the characters, after all the markup is gone.

If you do want to do it within the markup, the problem as stated was underspecified, but here is a possible (probably partial) solution. As always documented to the "literate programming" standards discussed in an earlier thread on this list.

<x>
<para>Some    text    <em>some    other   text</em>   remaining text</para>
<para>Some    text<em>    some    other   text</em>   remaining text</para>
<para>Some    text    <em>   some    other   text</em>   remaining
text</para>
<para>Some    text    <em>some    other   text </em>   remaining text</para>
<para>Some    text    <em>some    other   text </em>remaining text</para>
<para> Some    text    <em>some    other   text</em>   remaining text
</para>
</x>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0"
                >

<xsl:output method="xml" indent="yes"/>

<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>


<xsl:template match="para/text()">
<xsl:value-of select="normalize-space(.)"/>
<xsl:if test="contains(concat(.,'^$%'),' ^$%') and following-sibling::* and
 not(following-sibling::*[1]/node()[1][self::text() and starts-with(.,'
')])">
  <xsl:text> </xsl:text>
</xsl:if>
</xsl:template>



<xsl:template match="para/*/text()">
<xsl:if test="starts-with(.,' ')"><xsl:text> </xsl:text></xsl:if>
<xsl:value-of select="normalize-space(.)"/>
<xsl:if test="contains(concat(.,'^$%'),' ^$%') or 
  ../following-sibling::node()[1][self::text() and starts-with(.,' ')]">
  <xsl:text> </xsl:text>
</xsl:if>
</xsl:template>

</xsl:stylesheet>
bash-2.01$ xt normsp.xml normsp.xsl 
<?xml version="1.0" encoding="utf-8"?>
<x>
<para>Some text <em>some other text </em>remaining text</para>
<para>Some text<em> some other text </em>remaining text</para>
<para>Some text<em> some other text </em>remaining text</para>
<para>Some text <em>some other text </em>remaining text</para>
<para>Some text <em>some other text </em>remaining text</para>
<para>Some text <em>some other text </em>remaining text</para>
</x>

3.

Normalising space in mixed content.

David Carlisle and Wendell Piez

Why not strip it all, then put it back where you want it? So....

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="xml" indent="yes"/>

<xsl:template match="text()">
  <!-- strip extra whitespace from text nodes
       (including leading and trailing whitespace) -->
  <xsl:value-of select="normalize-space(.)"/>
</xsl:template>

<xsl:template match="*">
  <!-- default element rule is identity transform -->
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

<xsl:template match="*[../text()[normalize-space(.) != '']]">
  <!-- but this template matches any element appearing in mixed content -->
  <xsl:variable name="textbefore"
       select="preceding-sibling::node()[1][self::text()]"/>
  <xsl:variable name="textafter"
       select="following-sibling::node()[1][self::text()]"/>
  <!-- Either of the preceding variables will be an empty node set 
       if the neighbor node is not text(), right? -->
  <xsl:variable name="prevchar"
       select="substring($textbefore, string-length($textbefore))"/>
  <xsl:variable name="nextchar"
       select="substring($textafter, 1, 1)"/>

  <!-- Now the action: -->
  <xsl:if test="$prevchar != normalize-space($prevchar)">
  <!-- If the original text had a space before, add one back -->
    <xsl:text> </xsl:text>
  </xsl:if>

  <xsl:copy>
  <!-- Copy the element over -->
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates/>
  </xsl:copy>

  <xsl:if test="$nextchar != normalize-space($nextchar)">
  <!-- If the original text had a space after, add one back -->
    <xsl:text> </xsl:text>
  </xsl:if>

</xsl:template>

</xsl:stylesheet>

Using David's test:

<x>
<para>Some    text    <em>some    other   text</em>   remaining text</para>
<para>Some    text<em>    some    other   text</em>   remaining text</para>
<para>Some    text    <em>   some    other   text</em>   remaining
text</para>
<para>Some    text    <em>some    other   text </em>   remaining text</para>
<para>Some    text    <em>some    other   text </em>remaining text</para>
<para> Some    text    <em>some    other   text</em>   remaining text
</para>
</x>

We get output (using Saxon)
<x>
   <para>Some text <em>some other text</em> remaining text</para>
   <para>Some text<em>some other text</em> remaining text</para>
   <para>Some text <em>some other text</em> remaining text</para>
   <para>Some text <em>some other text</em> remaining text</para>
   <para>Some text <em>some other text</em>remaining text</para>
   <para>Some text <em>some other text</em> remaining text</para>
</x>

If you wanted to get space before the <em> element in the second case or after in the fifth case, the logic could be extended to catch them (left as an exercise :-).

4.

Seperating in-line elements

Jeni Tennison


I'm trying to convert something like
<P>Text1<H1>Head1</H1>Text2<H1>Head2</H1>Text3</P>

to 

<P>Text1</P>
<H1>Head1</H1>
<P>Text2</P>
<H1>Head2</H1>
<P>Text3</P>

	

I'll talk through your solution, which will hopefully show you why it's going wrong, and then give you an alternative approach.

>I tried something like 
><xsl:template match="P">
> <xsl:for-each match=".">
>  <xsl:chose>
>   <xsl:when test="H1">
>    <H1><xsl:apply-templates></H1>
>   </xsl:when>
>   <xsl:otherwise>
>    <P><xsl:apply-templates></P>
>   </xsl:otherwise>
>  <xsl:chose>
> </xsl:for-each>
></xsl:template>

There are three errors that your XSLT processor should have picked up on:

1. the 'match' attribute on the xsl:for-each should be a 'select' attribute
2. the 'xsl:chose' element should be called 'xsl:choose'
3. the xsl:apply-templates elements should be closed
4. the xsl:choose end tag is missing a '/'

After making those changes, what this template says is:

When you find a 'P' element, then for each current node, if it contains an H1 element, then create an H1 element and make its contents the result of applying templates to the children of this element, and if not then make a P element with the contents being result of applyign templates to the children of this element.

There are a few things wrong with that. Firstly, you probably want to iterate over the *contents* of the P element, not the P element itself, in other words:

  <xsl:for-each select="node()">
    ...
  </xsl:for-each>

Secondly, within this xsl:for-each, you want to test whether the current node (which is either a text node or an H1 element) *is* an H1 element. The test="H1" tests whether the current node has an H1 element *as a child*. You should use:

  <xsl:when test="self::H1">
    ...
  </xsl:when>

instead.

Finally, within the xsl:for-each, if the node is a text node (e.g. 'Text1'), then rather than applying templates to its children (it doesn't have any), you want to get its value:

  <xsl:otherwise>
    <xsl:value-of select="." />
  </xsl:otherwise>

If you make these changes, you get:

<xsl:template match="P">
 <xsl:for-each select="node()">
  <xsl:choose>
   <xsl:when test="self::H1">
    <H1><xsl:apply-templates /></H1>
   </xsl:when>
   <xsl:otherwise>
    <P><xsl:value-of select="." /></P>
   </xsl:otherwise>
  </xsl:choose>
 </xsl:for-each>
</xsl:template>

which creates the desired output in Saxon and Xalan.

However, an iterative approach is generally not the best approach to take when you're faced with mixed content. Rather than thinking in terms of iterating over the contents of the P element, you can let the processor fall back on built in rules that apply templates to its content automatically. Then you can construct rules (templates) that tell the processor what to do when it comes across, within a P element, either an H1 or a text() node, or anything else.

So, you can use the following template to say 'whenever there's a text() node within a P element, create a P element to hold its content':

<xsl:template match="P/text()">
  <P><xsl:value-of select="." /></P>
</xsl:template>

And the following template to say 'whenever there's an H1 element within a P element, create an H1 element with the content being the result of applying templates to its content':

<xsl:template match="P/H1">
  <H1><xsl:apply-templates /></H1>
</xsl:template>

5.

Element name normalisation

Wendell Piez

Sometimes SGML tools will do things that mess with tag names, such as folding them uniformly into upper or lower case. This is a pain because they must be switched back into their original forms (their DTD-normalized forms) to be validated as XML, if you want to move the file back into an XML environment. If their proper forms happen to include mixed-case names, you're in a hole. The perfect solution would be a DTD-aware process. A less perfect approach is to provide the correct names in an alternate spec of some sort. Since unextended XSLT has no access to the DTD's model, this stylesheet takes the latter approach.

This stylesheet is wired to run with a configuration file 'mlangnames.xml' as its listing of names (element names, attribute names and names of enumerated attribute values); but you can override that either by changing the stylesheet or by setting by switching the 'names' parameter on the command line, thus normalizing to the tag set of your choice.

So, for example, a routine invoking the Saxon processor could run:

Saxon -o fixed.xml messedup.xml xmlnames.xsl names=mynames.xml

See mlangnames.xml for the format of the names lists and this is the stylesheet

6.

Normalise space

Jeni Tennison



>> For the first state (Alabama), "admissions/state/text()"
>> evaluates to something like:
>> 
>> Alabama<cr><space><space><cr><space><space>
>> 
>> Which is NOT the same as:
>> 
>> Alabama
>> 

normalize-space() strips leading and trailing space, so if the string was:

  "Alabama<cr><space><space><cr><space><space>"

then all that trailing space would be stripped and you'd get:

  "Alabama"

It's only spaces in the *middle* of the string that get collapsed down to a single string. So for example:

  "New<cr><space><space><cr><space><space>York"

would become:

  "New York"