Text processing

1. Text replace
2. Spell check?
3. Breaking string into substrings
4. Unwanted spaces in built strings
5. value-of with separator
6. Entities
7. Conditional display of parameters
8. Restrict a string to n words
9. First n words of a text nodes
10. String match, multiple targets
11. Quotes in XST 2.0

1.

Text replace

Michael Kay

To display a list of software releases with ", " delimited that are mentioned in XML as ";" delimited:

<procedure software="3.3.7;3.3.8;3.3.9;3.4.0;3.4.2;3.4.4">

it's simpler than that, in 2.0 you can do

<xsl:value-of select="replace(@software, ';', ', ')"/>

2.

Spell check?

Dimitre Novatchev


> As far as the XSLT 2.0 working draft goes in 
> regards to bringing Perl  type text processing to the XML 
> developer it is still up to the developer to fine-tune these 
> capabilities to cover their specific needs.  For example, a spell 
> checker.

> Can anyone who may have extended experience in regards to the 
> development of such capabilities using XSLT share with the rest of us 
> your experience?

These days I had fun with an f:binSearch() function and then, logically, with f:spell().

I have a dictionary of about 47000 English wordforms, on which I search with f:binSearch()

I had to produce a faster fn than the current quadratical str-split-to-words template -- this is the f:getWords() function.

All these functions can be downloaded from the FXSL CVS at Sourceforge

The combination of these functions works quite well.

This transformation (test-FuncSpell.xsl):

 
<xsl:stylesheet version="2.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema"
 xmlns:f="http://fxsl.sf.net/"
 exclude-result-prefixes="f xs"
 >
  <xsl:import href="../f/func-getWords.xsl"/>
  <xsl:import href="../f/func-spell.xsl"/>
  
  <xsl:output omit-xml-declaration="yes"/>
 
 <xsl:variable name="vDelim" 
    as="xs:string"> ,—:.-&#9;&#10;&#13;!?;</xsl:variable>
<!-- Space, Coma, mdash, Colon, Dot, Dash, Tab, NL, CR, Exclamation mark,
Question mark, Semicolon -->

 <!-- To be applied on ../data/othello.xml -->
  <xsl:template match="/">
    <xsl:variable name="vwordNodes" as="element()*">
       <xsl:for-each select="//text()/lower-case(.)">
         <xsl:sequence select="f:getWords(., $vDelim, 1)"/>
       </xsl:for-each>
    </xsl:variable>
    
    <xsl:variable name="vUnique" as="xs:string+">
      <xsl:perform-sort select="distinct-values($vwordNodes)">
        <xsl:sort select="."/>
      </xsl:perform-sort>
    </xsl:variable>
    
    <xsl:variable name="vnotFound" as="xs:string*"
     select="$vUnique[not(f:spell(.))]"/>

    <xsl:value-of separator="&#xA;"
     select="$vnotFound"/>
  
    A total of <xsl:value-of select="count($vwordNodes)"/> words 
    were spelt, (<xsl:value-of select="count($vUnique)"/>) distinct.
    
    <xsl:value-of select="count($vnotFound)"/> not found.
 </xsl:template>
</xsl:stylesheet>

when applied on othello.xml (around 29000 words)

produces this result:


<a-list-of-567-unknown-words-ommitted/>

  
    A total of 28622 words 
    were spelt, (3669) distinct.
    
    567 not found.

So, checking 3669 distinct words in 7015 milliseconds makes

523.02 words/sec.

The actual speed is faster, as the total time includes splitting up the words and finding the distinct words.

Among the unknown words are such nice words as:

affordeth
affrighted
ariseth
arithmetician
arrivance
bethink
betimes
bewhored

3.

Breaking string into substrings

David Carlisle



> In my project I am dealing with at least 10 delimeters ? 

If you don't need to do different things for different delimiters, then:

The XPath2 tokenize function allows the delimiter to be specified by a regular expression, so in that case you can just specify whatever you want, eg ^[a-zA-Z]+ for any run of non (ascii) letters being a delimiter.

4.

Unwanted spaces in built strings

Michael Kay

When building a string in a variable and you want to avoid document node creation it's preferable to use item()+ than xs:string+ as that allows the merging of adjacent text nodes before atomization, creating a sequence of one item therefore bypassing the separator issue?

Mike Kay responds, logical as ever, with

No, if you want to build a string in a variable then you should define the type of the variable as xs:string.

You can either do:

 
  <xsl:variable name="foo" as="xs:string">
   <xsl:value-of>
    <xsl:text>abc</xsl:text><xsl:value-of select="'def'"/>
   </xsl:value-of>
  </xsl:variable>

or

  <xsl:variable name="foo" as="xs:string">
   <xsl:sequence select="concat('abc', 'def')"/>
  </xsl:variable>

or of course

  <xsl:variable name="foo" select="concat('abc', 'def')"/>

Personally I prefer to work entirely with strings. If you don't need text nodes, don't create them.

David Carlisle offers the same advice, with

If you want to build a (sequence of) string(s) in a variable then you have to use xs:string+ as otherwise you will get other things in your sequence (like nodes). Sometimes you don't need to worry about the difference between a string and a text node, but somethimes you do (as you've demonstrated). In general the distinction is far more important in 2.0 than 1.0.

So, if you want a sequence of strings use xsd:string and use separator="" on a value-of if you don't want the spaces.

If you want a sequence of nodes and/or strings use item()+ (but personally if I knew there was a possibility of adding an unwanted space I'd just always add separator="" rather than run through the six point simple content construct in my head every time to see if the spaces wouldn't be added)

From Andrews earlier query, Michael Kay justifies the WG decision:

> Indeed, the key here is that text nodes get merged together at stage 2
> before the atomization at stage 3 - which is why I'm still 
> confused if I
> specify xs:string+ instead of item()+:

> <xsl:variable name="foo" as="xs:string+">
>       <xsl:text/>abc<xsl:sequence select="'def'"/>
> </xsl:variable>

> <xsl:variable name="foo2" as="xs:string+">
>       <xsl:text/>abc<xsl:value-of select="'def'"/>
> </xsl:variable>


> <xsl:value-of select="$foo" separator=","/>

> Gives ,abc,def

> <xsl:value-of select="$foo2" separator=","/>

> Gives ,abc,def

> I would have expected the same behaviour as specifying item()+ as
> atomization occurs after the merging of the text nodes in $foo2

> This suggests that by specifying xs:string Saxon is jumping 
> the gun and
> converting the text nodes (zero length text nodes as well - 
> the leading
> comma) to strings before stage 2.

Indeed, it's converting text nodes to strings before it even starts the "constructing simple content" process. The overall process here is:

1. Construct the value of the variable

1a. evaluate the sequence constructor, producing a sequence of text nodes (foo2) or a sequence containing a mixture of text nodes and strings (foo)

1b. apply the "function conversion rules" to convert the result of 1a to the required type (xs:string+). This causes text nodes to be individually atomized to strings

2. xsl:value-of select="$foo" then invokes the "constructing simple content" rules to convert a sequence of text nodes and/or strings to a single text nodes. In both cases the input is a sequence of strings, so the rules for joining adjacent text nodes don't kick in.

It's certainly true that this whole business is going to generate a lot of questions. We've tried to design the rules so that they are (a) backwards compatible, and (b) do the "right" thing in common cases. The downside of this is that the rules are quite complicated, and when they don't do the obvious thing, it's quite tricky to work out why.

5.

value-of with separator

David Carlisle




I have trouble understanding the separator-attribute of value-of.
This is my template:

<xsl:template match="example">
  <helloWorld>
     <xsl:value-of select="element()/text()" separator=", "/>
  </helloWorld>
</xsl:template>

The element "example" contains several child-nodes with text. The
above expression gives the  expected values but without the separator
("TextTextText"). But if i change the value-of expression to this...

<xsl:value-of select="*" separator=", "/>

... I also get the values, now separated with commas ("Text, Text, Text").

Now I wonder why the result of the first expression contains no
separator while the other one does. Any explanations?

if the doc is

<x>
<a>one<!-- here -->two</a>
<a>three</a>
<a>four</a>
</x>

and the current node is x then

element()/text() will select four text nodes with values "one" "two" "three" "four" so <xsl:value-of select="element()/text()" separator=", "/> will generate one text node with value "one, two, three, four"

* will select three element nodes, each with name a and with string values "onetwo" "three" "four" so

<xsl:value-of select="*" separator=", "/>

will generate one text node with string value "onetwo", "three", "four"

>Isn't it that adjacent text nodes are being merged before the
> separator is applied?

no the separator is added between the string value of each item in the sequence, resulting in a string that is then used to generate a single text node, this node may merge with other text nodes generated under the same parent, but that happens after separators are added.

6.

Entities

Michael Kay, David Carlisle





I have to replace occurences of something like this "^12" (a custom
placeholder) with a special character, for example a cedilla
(http://en.wikipedia.org/wiki/Cedilla). Therefore i use Michael's
"identity template" in combination with this replace function:
<xsl:value-of select="replace(., '\^12', '&ccedil;')"/>

Many questions arise around this approach (at least for me):
1.) If i use the above i get "the entity is not declared", but when i
use "&amp;" instead of the cedilla everything works fine. How do i
know which entities are declared by default and which not? How can i
declare an entity of my own?
2.) What is the difference of the usage of - for example - "&amp;" and
"&#38;"? When do i have to use the one and not the other?
3.) And where can i find a good overview of enitites?

David answers

Entities are expanded by the parser _before_ XSLT starts, so XSLT sees the same input whether you use the entity reference, or just use the character directly, or if you use a numeric character reference (which doesn't need to be declared).

So if your keyboard or editor allows you just to type a c-cedila character then you can just do that (if your editor uses iso-8859-1 you'd need to say your xsl file was in that encoding by putting <?xml version="1.0" encoding="iso-8859-1"?> at the top, or you can use the numeric reference & # x e 7 ;

> Do i have to nest replace-functions for each of them in one
> another like ...
> <xsl:value-of select="replace(replace(., '\^12', '&ccedil;'), '\^13',
> '&amp;')"/>
> ... or is there a more elegant solution for this?

in the most general case of course, nesting replace() calls is always an option but here I suspect that you can do something like

<xsl:analayze-space select="." regex="\^([0-9]+)">
<xsl:matching-substring>
<xsl:value-of select="$replacements[number(regex-group(1))]"/>
<xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:value-of select="."/>
</xsl:non-matching-substring>

together with something that makes $replacements a sequemce with a c-cedila in position 12 eg

<xsl:variable name="replacements"  select="(
'a','b',...,'&#e7;', '&amp;')
"/>

MK adds


> 2.) What is the difference of the usage of - for example -
> "&amp;" and "&#38;"? When do i have to use the one and not the other?

The first is technically an "entity reference", the second is a "character reference". The only difference is that entities have to be declared in the DTD, except for the five built-in ones. Numeric character references can be used without needing a declaration.


> 3.) And where can i find a good overview of enitites?

I use Bob duCharme's "Annotated XML Specification" - very useful because it gives the actual text of the specification, then Bob's explanation of what it really means.


>
> Finally i have also a non-entity question:
> I have to replace many of equivalent placeholders in the same
> text too. Do i have to nest replace-functions for each of
> them in one another like ...
> <xsl:value-of select="replace(replace(., '\^12', '&ccedil;'),
> '\^13', '&amp;')"/> ... or is there a more elegant solution for this?
>

As well as DC's solution, another approach is to have a table of replacements:

<xsl:variable name="mods" as="element(mod)*">
 <mod from="\^12" to="&ccedil"/>
 <mod from="\^13" to="&amp;"/>
</xsl:variable>

and run through them with a recursive function:

<xsl:function name="f:multi-replace" as="xs:string">
 <xsl:param name="in" as="xs:string"/>
 <xsl:param name="mods" as="element(mod)*"/>
 <xsl:choose>
   <xsl:when test="$mods">
     <xsl:sequence select="f:multi-replace(
                             replace($in, $mods[1]/@from, $mods[1]/@to),
                             subsequence($mods, 2))"/>
   </xsl:when>
   <xsl:otherwise>
     <xsl:sequence select="$in"/>
   </xsl:otherwise>
 </
</

7.

Conditional display of parameters

Michael Kay



> I have three global xsl:param's which I need to display if
> any of them are selected:
>
>   <xsl:param name="showReplicates" select="1"/>
>   <xsl:param name="showMetadata" select="1"/>
>   <xsl:param name="showElectronicSignature" select="1"/>
>
> if all the flags are turned on, the output would be:
>
> (View - Reps, Meta-Data, Signatures)

<xsl:value-of select="string-join(
   ('Reps'[$showreplicates=1], 'MetaData'[$showMetadata=1],
'Signatures'[$showElectronicSignature=1]),
    ', ')"/>

8.

Restrict a string to n words

Michael Kay



>   What I need is a function that will limit a string to a
> certain number of words.

In XSLT 2.0, that's

tokenize($in, '\W')[position() = 1 to $n]

where $in is your input string and $n is the number of words.

It's a fair bit harder in XSLT 1.0 (most things are).

9.

First n words of a text nodes

Michael Kay




> I have an XML file like this:
> > 
> > <a> lorem ipsum text ... lorem ipsum <b> dolor </b> lorem ipsum ...
> > lorem ipsum </a>
> > 
> > I want to create a string of the five words from <a> right 
> > before the <b> tag, and a string of the five words 
> > immediately after <b>.  All I can do is create strings from 
> > the beginning or the end of the <a> tag, but basically I want 
> > the text in the middle, relative to the child <b> node.

In 2.0, assuming the <a> element is the context node, the two sequences are given by

subsequence(tokenize(b/following-sibling::text(), '\s'), 1, 5)

and

reverse(subsequence(reverse(tokenize(b/preceding-sibling::text(), 
'\s')), 1,5))

10.

String match, multiple targets

Andrew Welch



> Basically I want to have something that does this: 
> contains('$d/ris:organ/text()', 'Hamburg' or 'Koblenz' or 'xxx'...) 
>  ===> Compare 1 String with multpile strings.

some $x in ('Hamburg', 'Koblenz', 'xxx') satisfies
contains($d/ris:organ/text(), $x)

E.g.

<xsl:if test="some $x in ('Hamburg', 'Koblenz', 'xxx') satisfies
contains($d/ris:organ/text(), $x)">

Florent Georges adds

And the OP almost certainly wants the following instead:

    some $x in ('Hamburg', 'Koblenz', 'xxx') satisfies
      contains($d/ris:organ, $x)

11.

Quotes in XST 2.0

David Carlisle




>I want to hold a string containing both single and double quotes (apos
>and quot) in a variable.
><xsl:variable name="x" select="'...'"/>
>I enclose the XPath expression in double quotes, hence I'll have to use
>entity references or numerical character references to refer to that
>character from within the expression. Correct.
>I enclose the string in single quotes, hence - I think - I'll have to
>use entity references or numerical character references to refer to that
>character from with the string. And this is wrong.

in xpath1 it is impossible to have a string literal with both quotes.

that's not much hardship in xslt2 as, as you say, you can use <xsl:variable name="x">'"'"'"</xsl:variable> form but it does cause inconvience in other contexts,

xpath2 allows you to quote the character used to delimit the string by doubling it so you can have a string literal such as "'""" which is the two character string '"

Mike Kay expands on this...

And of course you can escape the attribute delimiter using an XML entity reference.

<xsl:if test="$x = 'He said, &quot;I can''t&quot'">

In 1.0 you can't have a string literal containing both single and double quotes. Use concat:

<xsl:variable name="quot">"</xsl:variable>
<xsl:variable name="apos">'</xsl:variable>
<xsl:if test="$x = concat('He said, ', $quot, 'I can', $apos, 't', $quot)">