xml to plain text using xslt

PlainText output

1. DocBook to plain text - what do you use?
2. Inserting page breaks into plain text output.
3. Layout dependent on position
4. IDENT
5. Line split and empty content
6. Indentation with whitespaces
7. Line split at 64 characters in plain text output
8. text output method, and newline problems.
9. Line breaking stylesheet for plain text
10. Splitting Camel Case strings.
11. Line breaking stylesheet for plain text
12. DocBook to plain text - what do you use?

1.

DocBook to plain text - what do you use?

Wendell Piez

Ednote. Wendell's explanation should apply to most XML formats, not just docbook



>I agree that it seems like it should be much easier.  That's one reason 
>I'm puzzled that such a thing doesn't seem to exist.

If you think about, for example, the way interpolations of inline pseudo-markup (like *this* for emphasis) and similar constructs will affect, for example, line wrapping, particularly since


   Some blocks need to get indented like this: it is several lines
   long, and is required to *wrap* nicely, no matter what might turn
   up in it -- requiring the smart introduction of whitespace both at
   line ends and at line starts (and maybe the extent of the indent
   varies as well) --

then it is apparent that creating "pretty plain text" is not as trivial as it may first appear.

My guess is that the graceful XSLT-only solution will require two or three passes over the data.

Another sad fact of life is that one person's pretty plain text is another's ugly stepsister.


>Is it just that no one is interested in producing plain text?  (For 
>example, to produce README files and such from a distribution's general 
>DocBook documentation sources?)  Or is the need little enough that lynx 
>-dump is good enough for people's purposes?

It seems to be one of those problems that is *nearly* general enough for a generic solution, but that has hidden gotchas and local particularities that have hindered the development of a one-size-fits-all solution.

Here's an article about an approach that uses Java (SAX) for the final stage of production of the plain text: ibm.com. So it's not that this problem hasn't come up before. (Not too long ago the list even discussed producing plain-text tables from XML -- a real beast.)

2.

Inserting page breaks into plain text output.

Chris Bayes

output a <FF> tag where you want the page break then post process the .txt file to change <FF> to x0C. You could do it in 3 lines of perl or 6 or so lines of javascript and it wouldn't take too long.

something like this off the top of my head

perl
   

C:> xalan source.xml transform.xsl sansff.txt
C:> type sansff.txt | perl convert.pl > withff.txt
convert.pl----------------------
#!perl
binmode;
my $slurp = join('', <>);
$slurp =~ s/<FF>/\x0C/gs;
print $slurp;
JavaScript
 
C:> xalan source.xml transform.xsl sansff.txt
C:> cscript convert.js sansff.txt withff.txt
convert.js----------------------
var fso = new ActiveXObject("Scripting.FileSystemObject");
var file = fso.OpenTextFile(WScript.Arguments(0), 1);
var fileStr = file.ReadAll();
fileStr = fileStr.replace(/<FF>/g , String.fromCharCode(12)); // or ""
pasted ff char
var outFile = fso.CreateTextFile(WScript.Arguments(1), true);
outFile.Write(fileStr);

3.

Layout dependent on position

David Carlisle

based on the pseudo XML below:

<groups>
    <group>First</group>
    <group>Second</group>
    <group>Third</group>
    <group>Fourth</group>
    <group>Fifth</group>
</groups>

in my XSL I want to get output, that "increments" for each item of group data, as it is processed, giving HTML output of:

[#] First
[#][#] Second
[#][#][#] Third
[#][#][#][#] Fourth
[#][#][#][#][#] Fifth

where "[#]" is some bit of text,(not a number) that can be repeated for spacing output in a pseudo tree format.

David offers:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="1.0"
                >

<xsl:output method="xml" indent="yes"/>

<xsl:template match="group">
<xsl:for-each select="preceding-sibling::*|.">[#]</xsl:for-each>
<xsl:value-of select="."/>
</xsl:template>

</xsl:stylesheet>

Ed Staub also offers:

Two references to the same page in ten minutes; see page 546 etc. in Michael Kay's "XSLT" book, on "Programming without Assignment Statements". Either recurse or use something like

substring(">>>>>>>>>>>",1,position())

which will only work for single characters.

In this case, using the > symbol.

<xsl:template match="group">
  <xsl:value-of select="substring('>>>>>>>>>>>',1,position())"/>
 <xsl:value-of select="."/>
</xsl:template>

4.

IDENT

Jeni Tennison

> I'm transforming a HTML file to a text file with xsl,  and I want to
> obtain a plain text, without lines feeds.  How I can obtain it?
	

The short answer: wrap xsl:text elements around the actual text that you want outputted.

The long answer: have a look at this template:

>   <xsl:template match="B" mode="special">
>     &lt;B&gt;
>     <xsl:value-of select="."/>
>     &lt;/B&gt;<xsl:text> </xsl:text>
>   </xsl:template>

When a DOM builder goes over that bit of XML, it creates a node tree that looks like:

+- (element) xsl:template
   +- (text) [NL][SP][SP]<B>[NL][SP][SP]
   +- (element) xsl:value-of
   +- (text) [NL][SP][SP]</B>
   +- (element) xsl:text
   |  +- (text) [SP]
   +- (text) [NL]

All those new lines and spaces are because of the indenting in the stylesheet. When the XSLT processor gets hold of this node tree, it first strips out all white-space only text nodes that are in the tree, unless they appear within a xsl:text element. So it gets rid of the last text node to give:

+- (element) xsl:template
   +- (text) [NL][SP][SP]<B>[NL][SP][SP]
   +- (element) xsl:value-of
   +- (text) [NL][SP][SP]</B>
   +- (element) xsl:text
      +- (text) [SP]

Now, when the XSLT processor applies this template, any plain text nodes that are found are copied directly to the result tree. The xsl:value-of adds the relevant value (e.g. 'x') and the xsl:text adds whatever its content is. You end up with a result tree that looks like:

+- (text) [NL][SP][SP]<B>[NL][SP][SP]x[NL][SP][SP]</B>[SP]

which is why you get all the whitesapce that you do in your output - it's copying over the whitespace that you've used to indent stuff in your stylesheet.

The way around this is to cut out the whitespace that you don't want using xsl:text elements to limit what whitespace is interpreted as whitespace for inclusion in the result tree, and what is just used to indent your stylesheet code.

If you use the template:

<xsl:template match="B" mode="special">
   <xsl:text> &lt;B&gt; </xsl:text>
   <xsl:value-of select="." />
   <xsl:text> &lt;/B&gt;</xsl:text>
</xsl:template>

Then the node tree that the DOM builder builds looks like:

+- (element) xsl:template
   +- (text) [NL][SP][SP]
   +- (element) xsl:text
   |  +- (text) [SP]<B>[SP]
   +- (text) [NL][SP][SP]
   +- (element) xsl:value-of
   +- (text) [NL][SP][SP]
   +- (element) xsl:text
   |  +- (text) [SP]</B>
   +- (text) [NL]

This time, a lot of whitespace disappears when the whitespace-only nodes are cut out by the XSLT processor:

+- (element) xsl:template
   +- (element) xsl:text
   |  +- (text) [SP]<B>[SP]
   +- (element) xsl:value-of
   +- (element) xsl:text
      +- (text) [SP]</B>

And the output is simply:

+- (text) [SP]<B>[SP]x[SP]</B>

which I think is what you're after.

5.

Line split and empty content

Jeni Tennison

A questionner asked: With xml is as follows:

<organisations>
   <orgRecord>
                <orgID>1</orgID>>
                <organisation>World Health</organisation>
                <street>3 street </street>
                <city>Liverpool</city>
                <state></state>
                <postalCode>L42 GH</postalCode>

  </orgRecord>
</organisations>

I need to concatenate the address fields:

<xsl:value-of select="concat(street, ', ', city, ', ', state, ', ',
postalCode)"/>

1. I would prefer the output to look like this

street,
city,
state,
postalCode

2. Is there way to check that if a "state" does not exist i.e. <state></state> (as in xml) to return:

street,
city,
postalCode

Jeni answers:

I guess that you mean you want to have a carriage return rather than just a space after each comma. You can either write them in literally with:

<xsl:value-of select="concat(street, ',
', city, ',
', state, ',
', postalCode)" />

Or you can use a character entity like:

<xsl:value-of select="concat(street, ',&#xA;', city, ',&#xA;',
                             state, ',&#xA;', postalCode)" />

There's no difference between the two as far as the XSLT processor is concerned.

> 2. Is there way to check that if a "state" does not exist i.e.
> <state></state> (as in xml) to return:

So you mean that the value of state is an empty string rather than that the element state doesn't exist? You can check whether the value of state is an empty string with:

  state = ''

or:

  not(string(state))

To get the output that you want, you need to split up the xsl:value-of and insert an xsl:if:

  <xsl:value-of select="concat(street, ',&#xA;', city, ',&#xA;')" />
  <xsl:if test="string(state)">
     <xsl:value-of select="concat(state, ',&#xA;')" />
  </xsl:if>
  <xsl:value-of select="postalCode" />

6.

Indentation with whitespaces

Jeni Tennison



> Since the template doesn't know in which level of the hierarchy it
> is, it can't indent the output correctly.
>
> How would you do this? Is there a way to recursively traverse the
> nodes? So when I'm at a group node, it would call a template to
> check again for group and field and branch off from there
> recursively?

XSLT is designed to work in this kind of recursive way. In your case, as you describe, you would have two templates: one matching group elements and one matching field elements, and the different templates would do different things.

This is a very neat way of doing it if you're producing HTML and using CSS to produce the indentation that you want. You can use div elements in HTML to add indentation at each level (or if you want a quick-and-dirty way, use blockquote elements):

&lt;xsl:template match="group">
   &lt;div class="indent">
      GROUP-LEVEL
      &lt;xsl:apply-templates />
   &lt;/div>
&lt;/xsl:template>

&lt;xsl:template match="field">
   FIELD-LEVEL
&lt;/xsl:template>

If you're producing text, on the other hand, then the easiest way to achieve the indentation is to pass down a 'indent' parameter from template to template. You can make a template accept a parameter with a xsl:param element. When you apply templates to the child elements of the group element, you need to pass on the parameter with the xsl:with-param element. At each group element, add some more to the indent that you pass down by concatenating some more spaces to the indent:

&lt;xsl:template match="group">
   &lt;xsl:param name="indent" />
   &lt;xsl:value-of select="$indent" />
   &lt;xsl:text>GROUP-LEVEL&amp;#xA;&lt;xsl:text>
   &lt;xsl:apply-templates>
      &lt;xsl:with-param name="indent" select="concat($indent, '   ')" />
   &lt;/xsl:apply-templates>
&lt;/xsl:template>

&lt;xsl:template match="field">
   &lt;xsl:param name="indent" />
   &lt;xsl:value-of select="$indent" />
   &lt;xsl:text>FIELD-LEVEL&amp;#xA;&lt;/xsl:text>
&lt;/xsl:template>

If you don't like this approach, you could continue to apply templates to all the descendants and then work out the depth of the element by counting how many ancestors it has with:

  count(ancestor::*)

and use that as the basis for an indentation or whatever. However, this is a lot more fiddly and it's likely to be a lot more inefficient than working going with the XSLT flow as described above.

7.

Line split at 64 characters in plain text output

Dimitre Novatchev.




> I'm searching for a template that will let me break a long line 
> into several shorter ones.  Specifically,

> * The output is for plain text, not HTML.
> * Each new line not exceed a parameterized 
> breakpoint (e.g., 64 characters)

> Any suggestions?  Thanks.

My first attempt was to find the number of 64 byte chunks and to produce them with simple iteration (using my "buildListWhile" template). While this was very simple to implement, the result is quite not appealing, due to line ending words being split in the middle...

An intelligent solution will not split a word if it overlaps the lines ending position. Instead, this word will be the first word of the next line.

Below I'm presenting such a solution. I'm re-using my functional tokenizer, published last month on the list

The idea is to parse the text and obtain a result structured like the following:

<line><word>My</word><word>first</word><word>attempt<
/word><word>was</word></line>
<line><word>to</word><word>find</word><
word>the</word><word>number</word></line>
<line><word>of</word><word>64</word><
word>byte</word><word>chunks</word></line>

where the string() of any "line" is the maximum that would fit the given line-length if words are not split apart.

I'm using once again the str-foldl template, which on every character instantiates the template matching "str-split2lines-func:*".

This template recognises every new word and accumulates the result, which is a list of "line" elements, each having a list of "word" children. After the last "line" element there's a single "word", in which the "current word" is being accumulated.

Whenever the current character is one of the specified delimiters, this signals the formation of a new word. This word is either added to the last line (if the total line length will not exceed the specified line-length), or a new line is started and this word becomes the first in the new line.

There are two possible improvements, which are left as an exercise to you:

1. If a single word exceeds the specified line-length, then it must be split apart.

2. Lines could be justified both to the left and to the right.

And here's the code:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:str-split2lines-func="f:str-split2lines-func"
exclude-result-prefixes="xsl msxsl str-split2lines-func"
>

   <xsl:import href="str-foldl.xsl"/>

   <str-split2lines-func:str-split2lines-func/>

   <xsl:output indent="yes" omit-xml-declaration="yes"/>
   
    <xsl:template name="str-split-to-lines">
      <xsl:param name="pStr"/>
      <xsl:param name="pLineLength" select="60"/>
      <xsl:param name="pDelimiters" select="' &#9;&#10;&#13;'"/>
      
      <xsl:variable name="vsplit2linesFun"
                    select="document('')/*/str-split2lines-func:*[1]"/>
                    
      <xsl:variable name="vrtfParams">
       <delimiters><xsl:value-of select="$pDelimiters"/></delimiters>
       <lineLength><xsl:copy-of select="$pLineLength"/></lineLength>
      </xsl:variable>

      <xsl:variable name="vResult">
	      <xsl:call-template name="str-foldl">
	        <xsl:with-param name="pFunc" select="$vsplit2linesFun"/>
	        <xsl:with-param name="pStr" select="$pStr"/>
	        <xsl:with-param 
              name="pA0" select="msxsl:node-set($vrtfParams)"/>
	      </xsl:call-template>
      </xsl:variable>
      
      <xsl:for-each select="msxsl:node-set($vResult)/line">
        <xsl:for-each select="word">
          <xsl:value-of select="concat(., ' ')"/>
        </xsl:for-each>
        <xsl:value-of select="'&#10;'"/>
      </xsl:for-each>
    </xsl:template>

    <xsl:template match="str-split2lines-func:*">
      <xsl:param name="arg1" select="/.."/>
      <xsl:param name="arg2"/>
         
      <xsl:copy-of select="$arg1/*[position() &lt; 3]"/>
      <xsl:copy-of select="$arg1/line[position() != last()]"/>
      
	  <xsl:choose>
	    <xsl:when test="contains($arg1/*[1], $arg2)">
	      <xsl:if test="string($arg1/word)">
	         <xsl:call-template name="fillLine">
	           <xsl:with-param 
             name="pLine" select="$arg1/line[last()]"/>
	           <xsl:with-param name="pWord" select="$arg1/word"/>
	           <xsl:with-param name="pLineLength" select="$arg1/*[2]"/>
	         </xsl:call-template>
	      </xsl:if>
	    </xsl:when>
	    <xsl:otherwise>
	      <xsl:copy-of select="$arg1/line[last()]"/>
	      <word><xsl:value-of 
               select="concat($arg1/word, $arg2)"/></word>
	    </xsl:otherwise>
	  </xsl:choose>
	</xsl:template>
      
      <!-- Test if the new word fits into the last line -->
	<xsl:template name="fillLine">
      <xsl:param name="pLine" select="/.."/>
      <xsl:param name="pWord" select="/.."/>
      <xsl:param name="pLineLength" />
      
      <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/>
      <xsl:variable name="vLineLength" select="string-length($pLine) 
                                             + $vnWordsInLine"/>
      <xsl:choose>
        <xsl:when test="not($vLineLength + string-length($pWord) >
             $pLineLength)">
          <line>
            <xsl:copy-of select="$pLine/*"/>
            <xsl:copy-of select="$pWord"/>
          </line>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$pLine"/>
          <line>
            <xsl:copy-of select="$pWord"/>
          </line>
          <word/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

</xsl:stylesheet>

When instantiated as follows:

    <xsl:template match="/">
      <xsl:call-template name="str-split-to-lines">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pLineLength" select="64"/>
        <xsl:with-param name="pDelimiters" 
         select="' &#9;&#10;&#13;'"/>
      </xsl:call-template>
    </xsl:template>

and the source xml document is:

<text>
Dec. 13 - As always for a presidential inaugural, security
and surveillance were extremely tight in Washington, DC,
last January. But as George W. Bush prepared to take the
oath of office, security planners installed an extra layer
of protection: a prototype software system to detect a
biological attack. The U.S. Department of Defense, together
with regional health and emergency-planning agencies,
distributed a special patient-query sheet to military
clinics, civilian hospitals and even aid stations along the
parade route and at the inaugural balls. Software quickly
analyzed complaints of seven key symptoms - from rashes to
sore throats - for patterns that might indicate the early
stages of a bio-attack. There was a brief scare: the system
noticed a surge in flulike symptoms at military clinics.
Thankfully, tests confirmed it was just that - the
flu.</text>

with the text occupying just one line, the result of the transformation is:

Dec. 13 - As always for a presidential inaugural, security and 
surveillance were extremely tight in Washington, DC, last 
January. But as George W. Bush prepared to take the oath of 
office, security planners installed an extra layer of 
protection: a prototype software system to detect a biological 
attack. The U.S. Department of Defense, together with regional 
health and emergency-planning agencies, distributed a special 
patient-query sheet to military clinics, civilian hospitals and 
even aid stations along the parade route and at the inaugural 
balls. Software quickly analyzed complaints of seven key 
symptoms - from rashes to sore throats - for patterns that might 
indicate the early stages of a bio-attack. There was a brief 
scare: the system noticed a surge in flulike symptoms at 
military clinics. Thankfully, tests confirmed it was just that - 
the flu. 

As can be seen, the largest line-length is 64 -- as specified.

8.

text output method, and newline problems.

Tom Passin



> What I've read so far for plain text output is unsatisfying. I have two
> issues.
>
> 1. Preventing initial first lines.
>
> This template will print a blank first line.
>
> <xsl:template>
> PDS_VERSION_ID          = PDS3
> RECORD_TYPE             = FIXED_LENGTH
> </xsl:template>
>

When all else fails, the minimally intrusive thing I know to do is this -

<xsl:template
>PDS_VERSION_ID    =  PDS3
RECORD_TYPE             =  FIXED_LENGTH
</xsl:template>

>
> 2. Inserting a carriage return.
>
> Simarly, putting in a bunch of
> <xsl:text>&#13;</xsl:text>
> elements disrupts the stylesheet.
>
> Is there any easier way to do it?
>

The reason this (and your earlier <xsl:text/> method) work is not the use of the xsl:text element per se, but rather the fact that there are two tags in a row with only whitespace between them. The processor has to decide if that white space is part of real character data or is just there for visual formatting and can therefore be ignored. The standard thing for the processor to do is to ignore such whitespace-only nodes.

Therefore, in example 1, any element will do, and if you are using the text output method, you could use <a/> instead of <xsl:text/>. Your text starts after the <a/> element, and there is no line feed there. The line feed between the <xsl:template> and the <a/> is ignored, and you get what you want.

As for new lines, they will be copied to the output as long as they are part of a chunk of character data. If you want them between, for example, two consecutive xsl:value-of elements, similar considerations apply. Whitespace between elements gets ignored. <xsl:text> is one way to put non-whitespace text there. A shorter way is to put a non-breaking space in place, like this:

<xsl:value-of select='"xxx"'/>&#160;
<xsl:value-of select='"yyy"'/>

Now the two xsl:value-of elements are not separated by whitespace, and so the new line is honored. In a decent editor or other viewer, the non-break ing space will display properly, but if the viewer expects one encoding and your output is in another, it may display as a strange character. But it is easier than <xsl:text>&#13;</xsl:text>. Any character will do, like a dash or period, but you do not always want something visible.

You can also try to define an entity to use instead of the non-breaking space, like putting this into your stylesheet:

<!DOCTYPE xsl:stylesheet[
 <!ENTITY CR "<xsl:text>&#13;</xsl:text>">
]>

<xsl:value-of select='"xxx"'/>&CR;
<xsl:value-of select='"yyy"'/>

This works with Saxon, XT, Xalan, and Sablotron, but I can never get it to work with msxml, which complains about missing namespaces when I run it it XML Cooktop.

9.

Line breaking stylesheet for plain text

Dimitre Novatchev.



> I'm searching for a template that will let me break 
> a long line into several  shorter ones.  Specifically,

> * The output is for plain text, not HTML.
> * Each new line not exceed a parameterized breakpoint 
>  (e.g., 64 characters)

My first attempt was to find the number of 64 byte chunks and to produce them with simple iteration (using my "buildListWhile" template). While this was very simple to implement, the result is quite not appealing, due to line ending words being split in the middle...

An intelligent solution will not split a word if it overlaps the lines ending position. Instead, this word will be the first word of the next line.

Bellow I'm presenting such a solution. I'm re-using my functional tokenizer, published last month: aspn.activestate.com

The idea is to parse the text and obtain a result structured like the following:

<line><word>My</word><word>first</word><word>attempt</word><word>was</word><
/line>
<line><word>to</word><word>find</word><word>the</word><word>number</word></l
ine>
<line><word>of</word><word>64</word><word>byte</word><word>chunks</word></li
ne>

where the string() of any "line" is the maximum that would fit the given line-length if words are not split apart.

I'm using once again the str-foldl template, which on every character instantiates the template matching "str-split2lines-func:*".

This template recognises every new word and accumulates the result, which is a list of "line" elements, each having a list of "word" children. After the last "line" element there's a single "word", in which the "current word" is being accumulated.

Whenever the current character is one of the specified delimiters, this signals the formation of a new word. This word is either added to the last line (if the total line length will not exceed the specified line-length), or a new line is started and this word becomes the first in the new line.

There are two possible improvements, which are left as an exercise to you:

1. If a single word exceeds the specified line-length, then it must be split apart.

2. Lines could be justified both to the left and to the right.

And here's the code:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:str-split2lines-func="f:str-split2lines-func"
exclude-result-prefixes="xsl msxsl str-split2lines-func"
>

   <xsl:import href="str-foldl.xsl"/>

   <str-split2lines-func:str-split2lines-func/>

   <xsl:output indent="yes" omit-xml-declaration="yes"/>
   
    <xsl:template name="str-split-to-lines">
      <xsl:param name="pStr"/>
      <xsl:param name="pLineLength" select="60"/>
      <xsl:param name="pDelimiters" select="' &#9;&#10;&#13;'"/>
      
      <xsl:variable name="vsplit2linesFun"
                    select="document('')/*/str-split2lines-func:*[1]"/>
                    
      <xsl:variable name="vrtfParams">
       <delimiters><xsl:value-of select="$pDelimiters"/></delimiters>
       <lineLength><xsl:copy-of select="$pLineLength"/></lineLength>
      </xsl:variable>

      <xsl:variable name="vResult">
	      <xsl:call-template name="str-foldl">
	        <xsl:with-param name="pFunc" select="$vsplit2linesFun"/>
	        <xsl:with-param name="pStr" select="$pStr"/>
	        <xsl:with-param name="pA0"
	  select="msxsl:node-set($vrtfParams)"/>
	      </xsl:call-template>
      </xsl:variable>
      
      <xsl:for-each select="msxsl:node-set($vResult)/line">
        <xsl:for-each select="word">
          <xsl:value-of select="concat(., ' ')"/>
        </xsl:for-each>
        <xsl:value-of select="'&#10;'"/>
      </xsl:for-each>
    </xsl:template>

    <xsl:template match="str-split2lines-func:*">
      <xsl:param name="arg1" select="/.."/>
      <xsl:param name="arg2"/>
         
      <xsl:copy-of select="$arg1/*[position() &lt; 3]"/>
      <xsl:copy-of select="$arg1/line[position() != last()]"/>
      
	  <xsl:choose>
	    <xsl:when test="contains($arg1/*[1], $arg2)">
	      <xsl:if test="string($arg1/word)">
	         <xsl:call-template name="fillLine">
	           <xsl:with-param name="pLine"
	  select="$arg1/line[last()]"/>
	           <xsl:with-param name="pWord" select="$arg1/word"/>
	           <xsl:with-param name="pLineLength" select="$arg1/*[2]"/>
	         </xsl:call-template>
	      </xsl:if>
	    </xsl:when>
	    <xsl:otherwise>
	      <xsl:copy-of select="$arg1/line[last()]"/>
	      <word><xsl:value-of 
	  select="concat($arg1/word,
	  $arg2)"/></word>
	    </xsl:otherwise>
	  </xsl:choose>
	</xsl:template>
      
      <!-- Test if the new word fits into the last line -->
	<xsl:template name="fillLine">
      <xsl:param name="pLine" select="/.."/>
      <xsl:param name="pWord" select="/.."/>
      <xsl:param name="pLineLength" />
      
      <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/>
      <xsl:variable name="vLineLength" select="string-length($pLine) 
                                             + $vnWordsInLine"/>
      <xsl:choose>
        <xsl:when test="not($vLineLength + 
	  string-length($pWord) > $pLineLength)">
          <line>
            <xsl:copy-of select="$pLine/*"/>
            <xsl:copy-of select="$pWord"/>
          </line>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$pLine"/>
          <line>
            <xsl:copy-of select="$pWord"/>
          </line>
          <word/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

</xsl:stylesheet>

When instantiated as follows:

    <xsl:template match="/">
      <xsl:call-template name="str-split-to-lines">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pLineLength" select="64"/>
        <xsl:with-param name="pDelimiters" 
	  select="' &#9;&#10;&#13;'"/>
      </xsl:call-template>
    </xsl:template>

and the source xml document is:

<text>
Dec. 13 - As always for a presidential inaugural, security and surveillance were
extremely tight in Washington, DC, last January. But as George W. Bush prepared to
take the oath of office, security planners installed an extra layer of protection: a
prototype software system to detect a biological attack. The U.S. Department of
Defense, together with regional health and emergency-planning agencies, distributed
a special patient-query sheet to military clinics, civilian hospitals and even aid
stations along the parade route and at the inaugural balls. Software quickly
analyzed complaints of seven key symptoms - from rashes to sore throats - for
patterns that might indicate the early stages of a bio-attack. There was a brief
scare: the system noticed a surge in flulike symptoms at military clinics.
Thankfully, tests confirmed it was just that - the flu.</text>

with the text occupying just one line, the result of the transformation is:

Dec. 13 - As always for a presidential inaugural, security and 
surveillance were extremely tight in Washington, DC, last 
January. But as George W. Bush prepared to take the oath of 
office, security planners installed an extra layer of 
protection: a prototype software system to detect a biological 
attack. The U.S. Department of Defense, together with regional 
health and emergency-planning agencies, distributed a special 
patient-query sheet to military clinics, civilian hospitals and 
even aid stations along the parade route and at the inaugural 
balls. Software quickly analyzed complaints of seven key 
symptoms - from rashes to sore throats - for patterns that might 
indicate the early stages of a bio-attack. There was a brief 
scare: the system noticed a surge in flulike symptoms at 
military clinics. Thankfully, tests confirmed it was just that - 
the flu. 

As can be seen, the largest line-length is 64 -- as specified.

10.

Splitting Camel Case strings.

Dimitre Novatchev



> It's often the habit to camel-case strings in IDs and 
> such. 

> Let's suppose I want to split the string at the upper case letters
> Anybody have any thoughts on how to do this?

It seems to me that you want to preserve the capital letters? If *not* so, then the following is a most straightforword solution using the "str-split-to-words" template of FXSL:

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>

   <xsl:import href="strSplit-to-Words.xsl"/>
<!-- This transformation must be applied to:
        testSplitToWords4.xml               
-->

   <xsl:output indent="yes" omit-xml-declaration="yes"/>

   <xsl:variable name="vCaps" 
    select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
    
    <xsl:template match="/">
      <xsl:call-template name="str-split-to-words">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pDelimiters" 
                        select="$vCaps"/>
      </xsl:call-template>
    </xsl:template>
</xsl:stylesheet>

when applied against this source.xml:

<t>thisIsACamelCasedWord</t>

Produces:

<word>this</word>
<word>s</word>
<word>amel</word>
<word>ased</word>
<word>ord</word>

In case you need to preserve the capital letters, the solution is slightly different. One first pass is made on the string, which inserts a space in front of every capital letter. The newly produced string is then tokenised. In the first pass I also use the "str-map" template from FXSL.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:myMark="f:MarkAnUppercase" 
 exclude-result-prefixes="myMark"
>

   <xsl:import href="str-map.xsl"/>
   <xsl:import href="strSplit-to-Words.xsl"/>
<!-- This transformation must be applied to:
        testSplitToWords4.xml               
-->

   <xsl:output indent="yes" omit-xml-declaration="yes"/>

   <xsl:variable name="vCaps" 
    select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'"/>
    
    <myMark:myMark/>
    <xsl:template match="myMark:*">
      <xsl:param name="arg1"/>
      
      <xsl:if test="contains($vCaps, $arg1)">
        <xsl:text> </xsl:text>
      </xsl:if>
      <xsl:value-of select="$arg1"/>
    </xsl:template>
    
    <xsl:template match="/">
    
      <xsl:variable name="vSpaceDelimited">
        <xsl:call-template name="str-map">
          <xsl:with-param name="pFun" 
            select="document('')/*/myMark:*[1]"/>
          <xsl:with-param name="pStr" select="/*"/>
        </xsl:call-template>
      </xsl:variable>
      
      <xsl:call-template name="str-split-to-words">
        <xsl:with-param name="pStr" select="$vSpaceDelimited"/>
        <xsl:with-param name="pDelimiters" 
                        select="' '"/>
      </xsl:call-template>
    </xsl:template>
</xsl:stylesheet>

when applied against the same source.xml produces:

<word>this</word>
<word>Is</word>
<word>A</word>
<word>Camel</word>
<word>Cased</word>
<word>Word</word>

11.

Line breaking stylesheet for plain text

Dimitre Novatchev



> I'm searching for a template that will let me break a long line 
>into several  shorter ones.  
> Specifically,
> * The output is for plain text, not HTML.
> * Each new line not exceed a parameterized breakpoint (e.g., 64
characters)

My first attempt was to find the number of 64 byte chunks and to produce them with simple iteration (using my "buildListWhile" template). While this was very simple to implement, the result is quite not appealing, due to line ending words being split in the middle...

An intelligent solution will not split a word if it overlaps the lines ending position. Instead, this word will be the first word of the next line.

Below I'm presenting such a solution. I'm re-using my functional tokenizer, published last month on activestate.com:

The idea is to parse the text and obtain a result structured like the following:

<line><word>My</word><word>first</word><word>attempt</word>
<word>was</word></line>
<line><word>to</word><word>find</word><word>the<
/word><word>number</word></line>
<line><word>of</word><word>64</word><word>byte<
/word><word>chunks</word></line>

where the string() of any "line" is the maximum that would fit the given line-length if words are not split apart.

I'm using once again the str-foldl template, which on every character instantiates the template matching "str-split2lines-func:*".

This template recognises every new word and accumulates the result, which is a list of "line" elements, each having a list of "word" children. After the last "line" element there's a single "word", in which the "current word" is being accumulated.

Whenever the current character is one of the specified delimiters, this signals the formation of a new word. This word is either added to the last line (if the total line length will not exceed the specified line-length), or a new line is started and this word becomes the first in the new line.

There are two possible improvements, which are left as an exercise to you:

1. If a single word exceeds the specified line-length, then it must be split apart.

2. Lines could be justified both to the left and to the right.

And here's the code:

<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:msxsl="urn:schemas-microsoft-com:xslt"
xmlns:str-split2lines-func="f:str-split2lines-func"
exclude-result-prefixes="xsl msxsl str-split2lines-func"
>

   <xsl:import href="str-foldl.xsl"/>

   <str-split2lines-func:str-split2lines-func/>

   <xsl:output indent="yes" omit-xml-declaration="yes"/>
   
    <xsl:template name="str-split-to-lines">
      <xsl:param name="pStr"/>
      <xsl:param name="pLineLength" select="60"/>
      <xsl:param name="pDelimiters" select="' &#9;&#10;&#13;'"/>
      
      <xsl:variable name="vsplit2linesFun"
                    select="document('')/*/str-split2lines-func:*[1]"/>
                    
      <xsl:variable name="vrtfParams">
       <delimiters><xsl:value-of select="$pDelimiters"/></delimiters>
       <lineLength><xsl:copy-of select="$pLineLength"/></lineLength>
      </xsl:variable>

      <xsl:variable name="vResult">
	      <xsl:call-template name="str-foldl">
	        <xsl:with-param name="pFunc" select="$vsplit2linesFun"/>
	        <xsl:with-param name="pStr" select="$pStr"/>
	        <xsl:with-param name="pA0"
	  select="msxsl:node-set($vrtfParams)"/>
	      </xsl:call-template>
      </xsl:variable>
      
      <xsl:for-each select="msxsl:node-set($vResult)/line">
        <xsl:for-each select="word">
          <xsl:value-of select="concat(., ' ')"/>
        </xsl:for-each>
        <xsl:value-of select="'&#10;'"/>
      </xsl:for-each>
    </xsl:template>

    <xsl:template match="str-split2lines-func:*">
      <xsl:param name="arg1" select="/.."/>
      <xsl:param name="arg2"/>
         
      <xsl:copy-of select="$arg1/*[position() &lt; 3]"/>
      <xsl:copy-of select="$arg1/line[position() != last()]"/>
      
	  <xsl:choose>
	    <xsl:when test="contains($arg1/*[1], $arg2)">
	      <xsl:if test="string($arg1/word)">
	         <xsl:call-template name="fillLine">
	           <xsl:with-param name="pLine"
	  select="$arg1/line[last()]"/>
	           <xsl:with-param name="pWord" select="$arg1/word"/>
	           <xsl:with-param name="pLineLength" select="$arg1/*[2]"/>
	         </xsl:call-template>
	      </xsl:if>
	    </xsl:when>
	    <xsl:otherwise>
	      <xsl:copy-of select="$arg1/line[last()]"/>
	      <word><xsl:value-of 
	  select="concat($arg1/word, $arg2)"/></word>
	    </xsl:otherwise>
	  </xsl:choose>
	</xsl:template>
      
      <!-- Test if the new word fits into the last line -->
	<xsl:template name="fillLine">
      <xsl:param name="pLine" select="/.."/>
      <xsl:param name="pWord" select="/.."/>
      <xsl:param name="pLineLength" />
      
      <xsl:variable name="vnWordsInLine" select="count($pLine/word)"/>
      <xsl:variable name="vLineLength" select="string-length($pLine) 
                                             + $vnWordsInLine"/>
      <xsl:choose>
        <xsl:when test="not($vLineLength + 
	  string-length($pWord) > $pLineLength)">
          <line>
            <xsl:copy-of select="$pLine/*"/>
            <xsl:copy-of select="$pWord"/>
          </line>
        </xsl:when>
        <xsl:otherwise>
          <xsl:copy-of select="$pLine"/>
          <line>
            <xsl:copy-of select="$pWord"/>
          </line>
          <word/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:template>

</xsl:stylesheet>

When instantiated as follows:

    <xsl:template match="/">
      <xsl:call-template name="str-split-to-lines">
        <xsl:with-param name="pStr" select="/*"/>
        <xsl:with-param name="pLineLength" select="64"/>
        <xsl:with-param name="pDelimiters" 
	  select="' &#9;&#10;&#13;'"/>
      </xsl:call-template>
    </xsl:template>

and the source xml document is:

Ednote: Please note the following should not be line broken; laid out better for browser viewing :-)

<text> Dec. 13 - As always for a
presidential inaugural, security and surveillance were
extremely tight in Washington, DC, last January. But as
George W. Bush prepared to take the oath of office, security
planners installed an extra layer of protection: a prototype
software system to detect a biological attack. The
U.S. Department of Defense, together with regional health
and emergency-planning agencies, distributed a special
patient-query sheet to military clinics, civilian hospitals
and even aid stations along the parade route and at the
inaugural balls. Software quickly analyzed complaints of
seven key symptoms - from rashes to sore throats - for
patterns that might indicate the early stages of a
bio-attack. There was a brief scare: the system noticed a
surge in flulike symptoms at military clinics.  Thankfully,
tests confirmed it was just that - the flu.</text>

with the text occupying just one line, the result of the transformation is:

Dec. 13 - As always for a presidential inaugural, security and surveillance were extremely tight in Washington, DC, last January. But as George W. Bush prepared to take the oath of office, security planners installed an extra layer of protection: a prototype software system to detect a biological attack. The U.S. Department of Defense, together with regional health and emergency-planning agencies, distributed a special patient-query sheet to military clinics, civilian hospitals and even aid stations along the parade route and at the inaugural balls. Software quickly analyzed complaints of seven key symptoms - from rashes to sore throats - for patterns that might indicate the early stages of a bio-attack. There was a brief scare: the system noticed a surge in flulike symptoms at military clinics. Thankfully, tests confirmed it was just that - the flu.

As can be seen, the largest line-length is 64 -- as specified.

12.

DocBook to plain text - what do you use?

Wendell Piez




>I agree that it seems like it should be much easier.  That's one reason 
>I'm puzzled that such a thing doesn't seem to exist.

If you think about, for example, the way interpolations of inline pseudo-markup (like *this* for emphasis) and similar constructs will affect, for example, line wrapping, particularly since

Some blocks need to get indented like this: it is several lines long, and is required to *wrap* nicely, no matter what might turn up in it -- requiring the smart introduction of whitespace both at line ends and at line starts (and maybe the extent of the indent varies as well) --

then it is apparent that creating "pretty plain text" is not as trivial as it may first appear.

My guess is that the graceful XSLT-only solution will require two or three passes over the data.

Another sad fact of life is that one person's pretty plain text is another's ugly stepsister.


>Is it just that no one is interested in producing plain text?  (For 
>example, to produce README files and such from a distribution's general 
>DocBook documentation sources?)  Or is the need little enough that lynx 
>-dump is good enough for people's purposes?

It seems to be one of those problems that is *nearly* general enough for a generic solution, but that has hidden gotchas and local particularities that have hindered the development of a one-size-fits-all solution.

Here's an article about an approach that uses Java (SAX) for the final stage of production of the plain text: ibm.com. So it's not that this problem hasn't come up before. (Not too long ago the list even discussed producing plain-text tables from XML -- a real beast.)