Regex and text handling

1. What's a regex :-)
2. Pretty printing comments with long line length
3. Searching for url in text
4. Reverse a string (xslt 2)
5. Counting characters
6. Acronym handling
7. Converting a text File to XML
8. Conditional selection of an attribute value
9. {{ braces in regular expressions
10. Remove non numeric content
11. Value and units
12. XPath 2.0 Regex misunderstanding
13. Regex

1.

What's a regex :-)

DaveP

Since its not so long ago since I barely knew what it was, I've done an idiots guide to regex in XSLT 2.0. Usual format, linked from here, written as a stylesheet creating the html you see. Similarly misspelled too.

2.

Pretty printing comments with long line length

Jeni Tennison



>> In particular, my XML has lots of very long comment lines, which I
>>> would like to be wrapped to a sensible line length, and indented
>>> with a hanging indent like this:
>>>
>>>   <!-- blah blah
>>>     blah blah
>>>     blah blah -->

For what it's worth, XSLT 2.0 makes this a whole lot easier with its regular expression support. For example, you could use:

<xsl:template match="comment()">
  <xsl:text>&lt;!-- </xsl:text>
  <xsl:analyze-string select="normalize-space(.)"
                      regex=".{{1,69}}(\s|$)">
    <xsl:matching-substring>
      <xsl:if test="position() 
	  != 1"><xsl:text>     </xsl:text></xsl:if>
      <xsl:value-of select="." />
      <xsl:text>&#xA;</xsl:text>
    </xsl:matching-substring>
  </xsl:analyze-string>
  <xsl:text>     --></xsl:text>
</xsl:template>

[Note the doubling of the {}s in the regex attribute because it's an attribute value template; that's caught me out twice today!]

You could use something more subtle than normalize-space() to preserve *some* of the spacing within the comment but not others.

3.

Searching for url in text

Jeni Tennison



>Is it possible to write some XSL code which will search some text()
>for a URL pattern and replace it with a hyperlinked version of
>itself.

XSLT 2.0 makes this process a lot easier: you can use <xsl:analyze-string> with regular expressions to do this kind of string processing. In XSLT 2.0, the template would look more like:

 <xsl:template name="hyperlink">
   <xsl:param name="string" select="string(.)" />
   <xsl:analyze-string select="$string"
                       regex="http://[^ ]+">
     <xsl:matching-substring>
       <a href="{.}">
         <xsl:value-of select="." />
       </a>
     </xsl:matching-substring>
     <xsl:non-matching-substring>
       <xsl:value-of select="." />
     </xsl:non-matching-substring>
   </xsl:analyze-string>
 </xsl:template>

and of course the regex for matching URLs could be a lot more sophisticated.

4.

Reverse a string (xslt 2)

Mike Kay


>   I checked http://www.w3.org/TR/xpath-functions/ and, in the 
> list of functions on strings, i saw a number of string 
> functions but not one to simply reverse a string, which I 
> would have thought 
> would be fairly obvious.

It's easily done in XPath 2.0 as

   string-join(
      for $i in string-length($in) to 1 return substring($in, $i, 1),
      "")

or using a recursive function if you find that more elegant.

5.

Counting characters

Michael Kay


> What I want to do is to count the number of characters in a  
> text node and all previous text nodes children of the current 
> text node's parent.

Well the XPath 2.0 solution is

sum(for $i in preceding-sibling::text() return string-length($i))

For XSLT 1.0 it's much more difficult, it's the classic problem of summing a calculated value over a node-set. There are several workable solutions:

- Construct a result tree fragment containing the computed values, then use the sum() and xx:node-set() functions to do the summation.
- Use a recursive template
- Use Dimitre's FXSL library.

6.

Acronym handling

Jeni Tennison



I'll just mention that in XSLT 2.0, you can use <xsl:analyze-string> to do this. Something along the lines of:

  <xsl:variable name="acronyms" as="element(acronym)+"
    select="document('../xml/acronyms.xml')/acronyms/acronym" />

  <xsl:variable name="acronym-regex" as="xs:string"
    select="string-join($acronyms/@acronym, '|')" />

  <xsl:analyze-string select="$text" regex="{$acronym-regex}">
    <xsl:matching-substring>
      <xsl:variable name="acronym" as="xs:string" select="." />
      <acronym title="{$acronyms[@acronym = $acronym]}">
        <xsl:value-of select="$acronym" />
      </acronym>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
  </xsl:analyze-string>

7.

Converting a text File to XML

Michael Kay

This kind of thing is very much easier using XSLT 2.0

* use the unparsed-text() function to read the text file

* split it into individual lines using the tokenize() function

* Analyse/parse each line using xsl:analyze-string with a regex

* arrange it into a hierarchical structure using xsl:for-each-group

As you see, each stage benefits from new features in XSLT 2.0. All of them can be done using 1.0 (for example, step (a) can be replaced by passing the text as a string-valued parameter) but it's much harder work: sufficiently hard work that it's probably easier to use a different language, such as Perl.

input file

H-A-HEADER some content
I-AN-ITEM-1 more content
I-AN-ITEM-2  and again
S-A-SUMMARY-1 for variety
I-AN-ITEM-3  and change
S-A-SUMMARY-2 and different again

DaveP generated this Stylesheet to show the example.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0">

  <xsl:output method="xml" indent="yes" encoding="utf-8"/>

  <xsl:template match="/">
    <xsl:variable name="f" 
select="unparsed-text('unparsedEntity.txt','utf-8')"/>


    <someRoot>
      <xsl:for-each select='tokenize($f, "\n")'>
    <record> 
      <xsl:analyze-string  regex="[\-a-zA-Z0-9]+" select=".">
    <xsl:matching-substring>
      <word><xsl:value-of select="."/></word>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <other>
        <xsl:value-of select="."/>
      </other>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
    </record>
  </xsl:for-each>
</someRoot>
  </xsl:template>
</xsl:stylesheet>

8.

Conditional selection of an attribute value

Michael Kay



> choose a link for the title, base on the following conditions:
>   1. if the value of the link node has 'http://' string
>   2. if there's no 'http://' string get the value of the link node 
> that contains 'ftp://' string

> output should be: <a href="selected link">title</a>
> <?xml version="1.0" encoding="ISO-8859-1" standalone="yes"?> <record>  
> <data>
>   <link>http://www.link1.com</link>
>   <link>3csbv</link>
>   <link>ftp://link2.com</link>
>   <link>http://www.link3.com</link>
>   <title>title</title>
>  </data>
>  <data>
>   <link>45csgh</link>
>   <link>invalid link</link>
>   <link>ftp://link1.com</link>
>   <title>title</title>
>  </data>
> </record>
>

<xsl:template match="data">
<a href="{(link[matches(.,'^http://')],link[matches(.,'^ftp://')])[1]}">
<xsl:value-of select="title"/>
</a>
</xsl:template>

Ednote. Note the use of braces inside the AVT; and the use of the matches function in the predicate.

9.

{{ braces in regular expressions

Michael Kay

I quote Michael Kays reply here verbatim. Unless you have run into this problem, you may not appreciate it. Bottom line, be aware that quantifiers, such as {3}, to say that the match must be limited to 3 of the previous patterns, use the brace character, which has other interpretations in XSLT. It may cause you problems as it did me.

Ignoring the AVT rules, the regex syntax allows { to be used without escaping inside [], but not outside: outside [] it is reserved for use in regex quantifiers such as x{3}. So it must be escaped as \{. So the regular expression you want is

(\w|\{[^{}]*\})+

which is written in the regex attribute as

regex="(\w|\{{[^{{}}]*\}})+"

It might be less painful to do:

<xsl:variable name="regex">(\w|\{[^{}]*\})+</xsl:variable>
<xsl:analyze-string regex="{$regex}">

though sadly, I suspect that will prevent Saxon precompiling the regular expression :-(

A repetition count in a regex is indicated by curly braces, not square brackets. Remember also that in an attribute value template, curly braces must be doubled. So you want:

regex='(\d{{4}})(\d{{2}})(\d{{2}})'

Alternatively, you could just as well use

regex='(....)(..)(..)'

10.

Remove non numeric content

Mike Kay


> I need to strip out non-numerical values from a string.
> Here is a sample input value:  TUV0062
> And what I want is :  0062  (or just 62)
> What is the correct way to do this using XSLT 2?

neff.xsl

<?xml version="1.0" encoding="iso-8859-1"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version="1.0">

<xsl:output method="text"/>

<xsl:template match="/">
    <xsl:variable name="in" select="'TUV0062'"/>
  <xsl:value-of select="replace($in,'[^0-9]','')"/>
  
</xsl:template>

</xsl:stylesheet>

provides output of

62

11.

Value and units

Mike Kay



> I have a variable with a string value like "3.48in" or "1pt" 
> or "4" or 
> "#123456" etc. Of the values that contain units, the first 2 of this 
> particular list, I want to separate out the value from the 
> units. I can 
> include a list of possible unit values, say ("in", "cm", "pt", "em", 
> "px") or whatever.

<xsl:analyze-string select="$in" regex="^(\d*(\.\d*)(in|cm|pt|em|px)$">
 <xsl:matching-substring>
  <measure><xsl:value-of select="regex-group(1)"/></measure>
  <units><xsl:value-of select="regex-group(3)"/></units>
 </xsl:matching-substring>
 <xsl:non-matching-substring>
  <value><xsl:value-of select="."/></value>
 </xsl:non-matching-substring>
</xsl:analyze-string>

DaveP. Note that the first group (in brackets) matches the value, the second matches the decimal part of the value, and the third matching group holds the units which are known about.

12.

XPath 2.0 Regex misunderstanding

Abel Braaksma


> I have a date element:
>
> example
>
> <DATE>11/01/2006</DATE>
>
> I'm trying to write an XPath 2.0 Regex to winnow some of the more obvious 
> date format errors. I have tried for about a half-hour, and I admit to being stumped.

I have some trouble with understanding your "passing" and "failing" is about. However, if you are trying to remove the "more obvious date format errors", I believe your "matches(...)" needs to become a "not(matches(...))", since your regular expression is about inclusion, not exclusion.

That said, you can try the following (assuming American dates: MM/DD/YYYY) for matching any date, disallowing years > 2006 and allowing the format 1/2/2006:

<xsl:variable name="dates">
   <DATE>07/18/2006</DATE>
   <DATE>07/12/2006</DATE>
   <DATE>09/25/2006</DATE>
   <DATE>10/24/2006</DATE>
   <DATE>10/18/2006</DATE>
   <DATE>10/10/2006</DATE>
   <DATE>1/2/2006</DATE>

   <!-- false dates -->
   <DATE>22/12/2006</DATE>
   <DATE>00/10/2000</DATE>
   <DATE>01/32/2006</DATE>
   <DATE>10/10/2007</DATE>
   <DATE>12/12/20006</DATE>
</xsl:variable>

<xsl:variable name="date-regex">^(
   0?[1-9]|     <!-- 01-09 and 1-9 -->
   1[0-2]       <!-- 10, 11, 12 -->
   )/(
   0?[1-9]|     <!-- 01-09 and 1-9-->
   [1-2]\d|     <!-- 10-20 -->
   3[01]        <!-- 30, 31 -->
   )/(
   1\d{3}|      <!-- 1000-1999 -->
   200[0-6]     <!-- 2000-2006 -->
   )$
</xsl:variable>

<xsl:for-each select="$dates/DATE">
   <xsl:value-of select="concat(., ': ')" />

   <!-- add normalize-space, because of a bug
       in saxon prior to 8.0.0.4 with leading space -->
   <xsl:value-of select="matches(.,
       normalize-space($date-regex), 'x')" />

   <xsl:text>&#xA;</xsl:text>
</xsl:for-each>

This outputs:

07/18/2006: true
07/12/2006: true
09/25/2006: true
10/24/2006: true
10/18/2006: true
10/10/2006: true
1/2/2006: true
22/12/2006: false
00/10/2000: false
01/32/2006: false
12/12/20006: false

>
> Here is the relevant part of the template:
>
> <xsl:when test="matches(DATE,'[0-1][0-2]/[0-3][0-9]/2006')"><bad-date /></xsl:when>

What your statement implies is: output "bad-date" node when:

1) a date month is in the range (00, 01, 02, 10, 11, 12)

2) a date day is in the range (00, 01,... 09, 10, 11,.... 19, 20, 21, .... 29, 30, 31, ... 39

3) the year is 2006.

Well, I don't know much of your calendar system, but I can hardly believe you consider a date as "00/39/2006" as being correct, so here's a part of your problem. I know from my own experience that regexing numeric values is a tricky business (and is: think strings, not numbers).

For an article I wanted to write for a long time, but still haven't, I created a template that helps in regexing numeric values. It will simply output the right regexes for you, if you give it a number:

my:regex-from-number('376', 0) will give:

[0-2]\d{2}|
3[0-6]\d|
37[0-5]|
376|\d{2}

it requires some getting used to, but I recall that Jeffrey Friedl named this: enrolling the number, or something similar. For small numbers you can easily do it by hand, but it is still hard for many mere mortals. It is optimized for repeated digits (like 2006). The output regex works perfect. A few notes (if you plan to use it):

|\d{2}

Leave out this part if you require a fixed number of digits. I.e.: 034 and 009. By default, 34, 9 etc are allowed.

376

The input number. Repeating the number is not necessary for making a bullet proof regular expression, but it made me feel good. The larger the maximum number you need to match, the easier it gets putting it there: you see instantly what number is being matched.

The rest speaks for itself, I believe. But call in anytime if you want some additional help. The expressions in the opening are taken from this template to ensure I did the right thing, however, I made them a bit more readable.

<xsl:function name="my:regex-from-number">
   <xsl:param name="number" />
   <xsl:param name="pos" />
   <xsl:variable name="digit1" select="substring($number, $pos, 1)" />
   <xsl:variable name="digit2" select="substring($number, $pos + 1, 1)" />
   <xsl:variable name="len" select="string-length($number)" />

   <xsl:value-of select="
       if($len = $pos)
           then concat
               (
                   $digit1,
                   '|\d',

                   if($pos - 1 le 1) then ''
                   else concat('{', $pos - 1, '}')
               )

       else
           if ($digit2 = '0')
           then concat
               (
                   $digit1,
                   my:regex-from-number($number, $pos + 1)
               )

           else concat
               (
                   $digit1,

                   if(xs:integer($digit2) - 1 = 0) then '0'
                   else concat('[0-', xs:integer($digit2) - 1, ']'),

                   if($pos + 1 = $len) then '|'
                   else
                       if($len - $pos - 1 = 1) then '\d|'
                       else concat('\d{', $len - $pos - 1, '}|'),

                   '&#xA;', substring($number, 1, $pos),

                   my:regex-from-number($number, $pos + 1)
               )" />
</xsl:function>

13.

Regex

Abel Braaksma

A good thing to know about regexes is that, besides being powerful, they can be very dangerous too, esp. to the unaware, when backtracking causes the regex to run with exponential times for non-matching strings. An example of such a regex is in this post: nabble.com

If you are going to use regexes in a production environment make sure to test them thoroughly for this behavior or your processor may hang occasionally.

> Is there any way I can split this RegEx on separate lines and/or add
> whitespace so that it would be more readable?

You already heard of the 'x' modifier, but there are a few things that you should know before splitting your regex into a more readable format:

If you use Saxon, several bugs concerning whitespace handling have been fixed in the 8.8 and 8.9 release, some of which you may consider significant, like this one, which is now fixed: nabble.com

The "ignore whitespace" is very literally so. I.e., in XSLT regexes, this: fn:matches("hello world", "hello\ sworld", "x") returns true. The "\ s" part in the regex is, with whitespace removed, "\s" and matches a space. Most regex engines (Perl for one) consider an escaped space as a space.

The only place where you must be aware of whitespace with 'x' on i inside classes, where it is not ignored: [abc ] matches 'a', 'b', 'c' or ' '.

You probably don't want to do this, but this is allowed with the 'x' modifier: "\p{ I s B a s i c L a t i n }+" and is the same as "\p{IsBasicLatin}+".

And a tip for making your regexes more readable: introduce comments inside your regexes. In other regex languages you can do that inside the regex language, but not with a regex in XSLT. You can easily fix this by putting your regexes inside a variable and always calling them with the 'x' modifier:

<xsl:variable name="myregex" as="xs:string">
    (          <!-- grab everything -->
    "          <!-- start of a q. string -->
    [^"]*      <!-- zero or more non-quotes -->
    "          <!-- end of a q. string -->
    )          <!-- closing 'grab all' -->
</xsl:variable>

I use this method to some extend in a format that allows recursive and repetitive regexes on input by just supplying a 'parser' written in XSLT with a set of regexes placed in XML that are then applied to the input. If you have many regexes, you will find that it is easier to maintain them by working on some library and reuse.