Flat file transformation

1. How to convert flat data to hierarchical data
2. Positional Grouping, or Flat XML: Problem getting children
3. How to transform flat structure into hierarchical one
4. Convert a flat XML document
5. Flat structure to hierarchy
6. How to get structure out of a flat document
7. Nested sections from flat structure
8. Flat file to structured, attribute based
9. Styled paragraphs to structured XML
10. Text before end marker or applying structure
11. Making flat files strucutred hierarchically
12. Grouping flat element structures to make lists

1.

How to convert flat data to hierarchical data

Michael Kay

The trick to this is to write a stylesheet with template rules that use <xsl:apply-templates> to process the children of the current element in the usual way, but instead of selecting the physical children of an element, you select its logical children, based on following the relationships in your data. The only other thing needed is to start processing at the logical root of the tree,

2.

Positional Grouping, or Flat XML: Problem getting children

Michael Kay

There are broadly three approaches:

(a) in XSLT 2.0, use xsl:for-each-group group-starting-with=pattern

(b) handle it in the same way as value-based grouping (ie Muenchian grouping using keys, but with a carefully-crafted grouping key: in this case something like generate-id(self::NT|preceding-sibling::NT[1])

(c) recurse over the logical tree using xsl:apply-templates; on each iteration instead of selecting the physical children, you select the logical children, which are the following siblings that have the current record as the most recent logical parent: something like

apply-templates
select="following-sibling::*[generate-id(preceding-sibling::NT[1]) =
generate-id(current())]".

3.

How to transform flat structure into hierarchical one

Jeni Tennison

I get a flat structure as a result of SQL query and want to transform it
into hierarchical structure.

E.g. initial XML could look like:

<record_set>
    <record>
        <house_id>h1</house_id>
        <room_id>r1</room_id>
    </record>
    <record>
        <house_id>h1</house_id>
        <room_id>r2</room_id>
    </record>
    <record>
        <house_id>h2</house_id>
        <room_id>r3</room_id>
    </record>
</record_set>

and the result of transformation could look like:

<house_list>
    <house>
        <id>h1</id>
        <rooms>
            <room>
                <id>r1</id>
            </room>
            <room>
                <id>r2</id>
            </room>
        </rooms>
    </house>
    <house>
        <id>h2</id>
        <rooms>
            <room>
                <id>r3</id>
            </room>
        </rooms>
    </house>
</house_list>


Getting a grouped or hierarchical structure from a flat structure is a very common question, and there are design patterns in the FAQ that will help you to do this. I also agree with Mike and Wendell that adjusting your SQL output would probably be more efficient. However, I'm going to talk you through how to use the structures that are in the FAQ in your particular situation, because I know it's sometimes hard to translate between the two.

There are two things that you need to know about to solve your problem using the Meunchian method: using keys and comparing nodes.

First, you need to group the things that you want to group together. In your case, you want to group the records according to the house that they refer to. You do this by defining a key using xsl:key. The three attributes are:

1. name - this can be anything you want 'records', for example
2. match - these are the things that you want to group: the 'record' elements in your case
3. use - this is the thing that you want to index the groups by: the value of the 'house_id' element in your case

In other words:

  <xsl:key name="records" match="record" use="house_id" />

This allows you to use the key() function to access the groups that you've made. The first argument gives the name of the key. The second argument gives the value of the thing that you have grouped the groups by. In your case, to find out all the records that have a house_id of 'h1', you can say:

  key('records', 'h1')

Second, you want to step through each of these groups. You do this by stepping through each of the things that are grouped and identifying those that occur first in the groups. In your case, you step through each of the records and find out which ones are the first records about a particular house. Given a particular house_id, you can find the first ones in the groups by taking the first one of the node set that you get from the key:

  key('records', 'h1')[1]

To find out whether a node is the same as the first node in the group that you've defined using the key, you have to compare the nodes. One way of doing this is by comparing the unique id that is generated for the particular node with the unique id that is generated for the first node in the group. In your case, you can get the first record in each group using the pattern:

  record[generate-id() = generate-id(key('records', house_id)[1])]

You want to process only these items, the ones that are first within their particular group. Whether you process them using a xsl:for-each of an xsl:apply-templates is up to you: the select expression for both will be the same.

Finally, for each of these items, you want to cycle through the rest of the items in the group. You use the key() function to get at that group, and then just go through them (again using xsl:for-each or xsl:apply-templates as you desire).

In your case, I've used xsl:for-each because you're not doing anything particularly complicated and because this saves processing time because the XSL processor doesn't have to go off hunting for the correct template to apply next. The template is:

<xsl:template match="record_set">
  <house_list>
    <!-- cycle through the first records in each group -->
    <xsl:for-each select="record[generate-id() = generate-id(key('records',
house_id)[1])]">
      <house>
        <id><xsl:value-of select="house_id" /></id>
        <rooms>
          <!-- cycle through each of the records in the group -->
          <xsl:for-each select="key('records', house_id)">
            <room>
              <id><xsl:value-of select="room_id" /></id>
            </room>
          </xsl:for-each>
        </rooms>
      </house>
    </xsl:for-each>
  </house_list>
</xsl:template>

I have tested this within SAXON, using your example, and it gives the desired output.

It is very easy to add sorting to the solution: you use an xsl:sort element within each xsl:for-each, with the select attribute indicating the thing that you want to sort by. So in this case:

<xsl:template match="record_set">
  <house_list>
    <!-- cycle through the first records in each group -->
    <!-- in order based on house_id -->
    <xsl:for-each select="record[generate-id() = generate-id(key('records',
house_id)[1])]">
      <xsl:sort select="house_id" />
      <house>
        <id><xsl:value-of select="house_id" /></id>
        <rooms>
          <!-- cycle through each of the records in the group -->
          <!-- in order based on room_id -->
          <xsl:for-each select="key('records', house_id)">
            <xsl:sort select="room_id" />
            <room>
              <id><xsl:value-of select="room_id" /></id>
            </room>
          </xsl:for-each>
        </rooms>
      </house>
    </xsl:for-each>
  </house_list>
</xsl:template>

The same thing can be done with xsl:apply-templates (by including an xsl:sort within the xsl:apply-templates element), if you were using that instead.

4.

Convert a flat XML document

Jeni Tennison

With this structure:

<firstitem>Item 1</firstitem>
<item>Second Text</item>
<item>blah blah</item>
<item>tee hee</item>
<firstitem>List 2, Item 1</firstitem>
<item>more text</item>
<firstitem>more lists</firstitem>
<item>test</item>

into something like this (indenting is purely decorative):

<list>
	<item>Item 1</item>
	<item>Second Text</item>
	<item>blah blah</item>
	<item>tee hee</item>
</list>
<list>
	<item>List 2, Item 1</item>
	<item>more text</item>
</list>
<list>
	<item>more lists</item>
	<item>test</item>
</list>

> It seems to me that I can process each <firstitem> element 
> and then apply templates to all of the following-siblings 
> that have their first <firstitem>
> preceding-sibling equal to the current <firstitem>.  
> But, a) I can't seem to get
> the syntax right for this solution, 
> b) this seems like an odd way to solve the problem, 
> and c) somebody probably knows an easier way.

There are definitely other approaches (like you could use an xsl:for-each rather than applying templates), but your approach is an eminently reasonable one, and the biggest problem, the XPath expression, applies in both, so let's see if we can get the syntax working. I don't know how far you've got with it, so forgive me if I go through everything just in case!

First you want to "process each <firstitem> element". You do that with a template matching that kind of element, so:

<xsl:template match="firstitem">
  ...
</xsl:template>

Second, and importantly, you *only* want to process those firstitem elements. That means that the template that matches whatever element your firstitems and items are children of has to only be applying templates to the firstitem. I'll assume that they're children of an 'items' element:

<xsl:template match="items">
  <xsl:apply-templates select="firstitem" />
</xsl:template>

Third, you want to "apply templates to all of the following-siblings that have their first <firstitem> preceding-sibling equal to the current <firstitem>". Applying templates to a node list is easy:

<xsl:apply-templates select="..." />

The thing that's difficult is generating the node list! Let's take it a step at a time:

1. "all of the following-siblings..."

Actually, you're not interested in *all* of the following-sibling nodes, just in the 'item' elements. The XPath for that is:

  following-sibling::item

In David Carlisle's human-speak, this is "all items that are following siblings".

2. "...that have..."

Here you're narrowing the list you've got by putting an extra condition on it, so you use a predicate:

  following-sibling::item[...]

3. "...their first <firstitem> preceding-sibling..."

or 'the first preceding-sibling that is a firstitem':

  preceding-sibling::firstitem[1]

4. "...the current <firstitem>..."

This could be a little tricky because when we're in the predicate that we're currently in, the *context node* is an item. We want the *current node*, which we know is a firstitem (because that's what this template is matching on) and that we get using:

  current()

5. "...equal to..."

Now we get into a bit of a murky area that I got confused by myself recently. We want to make sure that the *node* that is 'the first preceding-sibling that is a firstitem' is the same as the *node* that is 'the current node'. There are two ways of doing that:

a. generate a unique identifier for the two nodes and compare them:

  generate-id(preceding-sibling::firstitem[1]) = generate-id(current())

b. make a set that is the union of the two nodes and see how many elements are in it - if they're the same node, then the answer is 1:

  count(preceding-sibling::firstitem[1] | current()) = 1

I'm going to use the first because it makes more sense to me. We have our XPath expression:

following-sibling::item[generate-id(preceding-sibling::firstitem[1])
= generate-id(current())]

So we slot that into our xsl:apply-templates, and put that into our firstitem-matching template:

<xsl:template match="firstitem">
  ...
  <xsl:apply-templates
select="following-sibling::item[generate-id(preceding-sibling::firstitem[1])
 = generate-id(current())]" />
  ...
</xsl:template>

What have we forgotten? Oh yes, the output! For each 'firstitem', we want to have a 'list', with the first 'item' having the content of the 'firstitem' and the other 'item's being copies of the original ones.

<xsl:template match="firstitem">
  <list>
    <item><xsl:value-of select="." /></item>
    <xsl:apply-templates
select="following-sibling::item[generate-id(preceding-sibling::firstitem[1])
 = generate-id(current())]" />
  </list>
</xsl:template>

<xsl:template match="item">
  <xsl:copy-of select="." />
</xsl:template>

I've tested the full stylesheet (below) in SAXON and it works.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output indent="yes" />

<xsl:template match="items">
  <xsl:apply-templates select="firstitem" />
</xsl:template>

<xsl:template match="firstitem">
  <list>
    <item><xsl:value-of select="." /></item>
    <xsl:apply-templates
select="following-sibling::item
     [generate-id(preceding-sibling::firstitem[1])
     = generate-id(current())]" />
  </list>
</xsl:template>

<xsl:template match="item">
  <xsl:copy-of select="." />
</xsl:template>

</xsl:stylesheet>

5.

Flat structure to hierarchy

Paul Tchistopolski


--------------- flat.xml:

 <Node>
 A1
 </Node>
 <Node>
 A2
 </Node>
 <Node>
 A3
 </Node>

 So I Can Get

  <A1><A2><A3></A3></A2></A1>

--------------- hier.xsl

<doc>

<Node>
        A1
</Node>
<Node>
        A2
</Node>
<Node>
        A3
</Node>


</doc>


--------- Stylesheet

<xsl:stylesheet
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 version="1.0">


<xsl:template match="/doc">

 <xsl:call-template name="level">
  <xsl:with-param name="node" select="/doc/Node[1]"/>
 </xsl:call-template>

</xsl:template>


<xsl:template name="level">
        <xsl:param name="node"/>

 <xsl:variable name="n" select="normalize-space($node)"/>
 <xsl:element name="{$n}">

  <xsl:if test="$node/following::Node">

   <xsl:call-template name="level">
   <xsl:with-param name="node" select="$node/following::Node"/>
   </xsl:call-template>

  </xsl:if>

 </xsl:element>

</xsl:template>

</xsl:stylesheet>

---- result:

<?xml version="1.0" encoding="utf-8"?>
<A1><A2><A3/></A2></A1>


    

6.

How to get structure out of a flat document

Jeni Tennison

This stylesheet can handle improperly nested headings to a certain extent (it doesn't create higher-level sections if they're missing). One thing that it doesn't do that Wendell's solution does is include any nodes that occur before the first h1, which may or may not be an issue.

The breakthrough moment for me came when I remembered that you can have a number of xsl:key elements with the same name, allowing you to index different nodes on different things within the same namespace, and hence to have a single template for all headings, cutting down on repetition.

Another thing that's possibly of note is that you don't have explicitly say what elements count as 'data', only that they are not headers. There's no need to explicitly match 'stuff' as data within the 'stuffchildren' (or in my case 'immediate-nodes') key, just anything that isn't a header.

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:template match="levels">
   <xsl:apply-templates select="h1" />
</xsl:template>

<xsl:key name="next-headings" match="h6"
         use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3 or self::h4 or
                                               self::h5][1])" />
<xsl:key name="next-headings" match="h5"
         use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3 or self::h4][1])" />
<xsl:key name="next-headings" match="h4"
         use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3][1])" />
<xsl:key name="next-headings" match="h3"
         use="generate-id(preceding-sibling::*[self::h1 or self::h2][1])" />
<xsl:key name="next-headings" match="h2"
         use="generate-id(preceding-sibling::h1[1])" />

<xsl:key name="immediate-nodes"
         match="node()[not(self::h1 | self::h2 | self::h3 | self::h4 |
                           self::h5 | self::h6)]"
         use="generate-id(preceding-sibling::*[self::h1 or self::h2 or
                                               self::h3 or self::h4 or
                                               self::h5 or self::h6][1])" />

<xsl:template match="h1 | h2 | h3 | h4 | h5 | h6">
   <xsl:variable name="level" select="substring-after(name(), 'h')" />
   <section level="{$level}">
      <head><xsl:apply-templates /></head>
      <xsl:apply-templates select="key('immediate-nodes', generate-id())" />
      <xsl:apply-templates select="key('next-headings', generate-id())" />
   </section>
</xsl:template>

<xsl:template match="stuff">
   <data><xsl:apply-templates /></data>
</xsl:template>

<xsl:template match="node()">
   <xsl:copy-of select="." />
</xsl:template>

</xsl:stylesheet>

7.

Nested sections from flat structure

Jeni Tennison


> My source documents are a bit troublesome. They have a structure
> like HTML. There are two elements that indicate start of a section,
> <head> and <a name="">. Both appear inside a p element. A source
> document can be:
> - without subsections
> - with subsections indicated by <head>
> - with subsections indicated by <a name="">
> - with subsections indicated by <a name="">, and <head> subsections
> inside the first subsections
> - with subsections indicated by <head>, and <a name=""> subsections
> inside the first subsections

When you have troublesome source then it's sometimes better to take a two-pass approach on the problem - transform it into something more approachable, and then transform *that* into the output that you want.

Your source would be a lot more approachable if it was clear what was a 'top' section and what were sections underneath them. In the following stylesheet, I first convert the source into:

<GeneralSection/>
<Paragraph>first some intro text</Paragraph>
<TopSection/>
<SectionTitle>A</SectionTitle>
<Paragraph>goes with A.</Paragraph>
<SubSection ID="A.1"/>
<Paragraph>Text with A.1</Paragraph>
<Paragraph>More text with A.1</Paragraph>
<SubSection ID="A.2"/>
<Paragraph>text with A.2</Paragraph>
<Paragraph>Goes with A.2</Paragraph>
<TopSection/>
<SectionTitle>B</SectionTitle>
<Paragraph>Goes with B.</Paragraph>
<Paragraph>Goes with  B.</Paragraph>
<TopSection/>
<SectionTitle>C</SectionTitle>
<Paragraph>Goes with C</Paragraph>
<Paragraph>Goes with C</Paragraph>

And from there using the Muenchian method to create the hierarchy.

The advantage is that both transformations are relatively straight-forward; the disadvantage is that you need to either use a node-set extension in your stylesheet or split the stylesheet in two and use the second on the output of the first.

If you want to use a one-pass approach, like your stylesheet is at the moment, the only suggestion I'd make for simplifying it is to split up all those xsl:for-eaches into separate templates, using modes to distinguish between the different types of processing you do within each.

Anyway, here's my approach (unless you're using a processor that supports exsl:node-set(), you'll need to change the namespace for the node-set() function):

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                xmlns:exsl="http://exslt.org/common"
                extension-element-prefixes="exsl">

<xsl:output method="xml" indent="yes" />

<xsl:strip-space elements="*" />

<xsl:variable name="top" select="topic/text/p[head or a/@name][1]" />

<xsl:template match="/">
   <xsl:variable name="sections">
      <GeneralSection />
      <xsl:apply-templates mode="convert" />
   </xsl:variable>
   <xsl:apply-templates select="exsl:node-set($sections)/GeneralSection"
                        mode="group" />
</xsl:template>

<xsl:template match="p" mode="convert">
   <Paragraph><xsl:apply-templates /></Paragraph>
</xsl:template>

<xsl:template match="p[head]" mode="convert">
   <xsl:choose>
      <xsl:when test="$top/head"><TopSection /></xsl:when>
      <xsl:otherwise><SubSection /></xsl:otherwise>
   </xsl:choose>
   <SectionTitle><xsl:apply-templates select="head" /></SectionTitle>
   <Paragraph>
      <xsl:apply-templates select="node()[not(self::head)]" />
   </Paragraph>
</xsl:template>

<xsl:template match="p[a/@name]" mode="convert">
   <xsl:choose>
      <xsl:when test="$top/a/@name">
         <TopSection ID="{a/@name}" />
      </xsl:when>
      <xsl:otherwise><SubSection ID="{a/@name}" /></xsl:otherwise>
   </xsl:choose>
</xsl:template>

<xsl:key name="children" match="Paragraph | SectionTitle"
         use="generate-id(preceding-sibling::*
                             [self::GeneralSection or self::TopSection or
                              self::SubSection][1])" />
<xsl:key name="children" match="TopSection"
         use="generate-id(preceding-sibling::GeneralSection[1])" />
<xsl:key name="children" match="SubSection"
         use="generate-id(preceding-sibling::TopSection[1])" />

<xsl:template match="GeneralSection | TopSection | SubSection"
              mode="group">
   <GeneralSection>
      <xsl:copy-of select="@*" />
      <xsl:apply-templates select="key('children', generate-id())"
                           mode="group" />
   </GeneralSection>
</xsl:template>

<xsl:template match="Paragraph" mode="group">
   <xsl:copy-of select="." />
</xsl:template>

</xsl:stylesheet>

8.

Flat file to structured, attribute based

Wendell Piez and Jeni


Given xml

 <doc>
 <p style="Heading1">Top Level Heading</p>
 <p style="Body">Some text at this level</p>
 <p style="Heading2">Sub Heading 1</p>
 <p style="Body">Some text here</p>
 <p style="Heading2">Sub heading 2</p>
 <p style="Body">Some text here</p>
 <p style="Heading3">Sub Heading 3</p>
 <p style="Body">Body 3 text</p>
 <p style="Heading4">Sub Heading 4</p>
 <p style="Body">Body 4 text</p>
 <p style="Heading1">Another Heading</p>
 <p style="Body">Some more text here</p>
 </doc>

Use xsl

<?xml version="1.0"?>
 <xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml" indent="yes"/>
  <!--i.e., convert the attribute into element, it's value is 
 related to p-->
 <xsl:key name="Body" match="p[@style='Body']"
   use="generate-id((preceding-sibling::p[@style='Heading1']|
        preceding-sibling::p[@style='Heading2']|
        preceding-sibling::p[@style='Heading3']|
        preceding-sibling::p[@style='Heading4']|
        preceding-sibling::p[@style='Heading5']|
        preceding-sibling::p[@style='Heading6'])[last()])" />
 
 <xsl:key name="h2" match="p[@style='Heading2']"
 use="generate-id(preceding-sibling::p[@style='Heading1'][1])"/>
 <xsl:key name="h3" match="p[@style='Heading3']"
 use="generate-id(preceding-sibling::p[@style='Heading2'][1])"/>
 <xsl:key name="h4" match="p[@style='Heading4']"
 use="generate-id(preceding-sibling::p[@style='Heading3'][1])"/>
 <xsl:key name="h5" match="p[@style='Heading5']"
 use="generate-id(preceding-sibling::p[@style='Heading4'][1])"/>
 <xsl:key name="h6" match="p[@style='Heading6']"
 use="generate-id(preceding-sibling::p[@style='Heading5'][1])"/>
 
 <!--   Wendell Piez algorithm -->
 <xsl:template match="doc">
 <main>
 <xsl:apply-templates select="key('Body',generate-id())" mode="Body"/>
 <xsl:apply-templates select="p[@style='Heading1']" mode="h1"/>
 </main>
 </xsl:template>
 
 <xsl:template match="p" mode="h1">
 <section level="1"><head><xsl:apply-templates/></head>
 <xsl:apply-templates select="key('Body',generate-id())" mode="Body"/>
 <xsl:apply-templates select="key('h2',generate-id())" mode="h2"/>
 </section>
 </xsl:template>
 
 <xsl:template match="p" mode="h2">
 <section level="2"><head><xsl:apply-templates/></head>
 <xsl:apply-templates select="key('Body',generate-id())" mode="Body"/>
 <xsl:apply-templates select="key('h3',generate-id())" mode="h3"/>
 </section>
 </xsl:template>
 
 <xsl:template match="p" mode="h3">
 <section level="3"><head><xsl:apply-templates/></head>
 <xsl:apply-templates select="key('Body',generate-id())" mode="Body"/>
 <xsl:apply-templates select="key('h4',generate-id())" mode="h4"/>
 </section>
 </xsl:template>
 
 <xsl:template match="p" mode="h4">
 <section level="4"><head><xsl:apply-templates/></head>
 <xsl:apply-templates select="key('Body',generate-id())" mode="Body"/>
 <xsl:apply-templates select="key('h5',generate-id())" mode="h5"/>
 </section>
 </xsl:template>
 
 <xsl:template match="p" mode="h5">
 <section level="5"><head><xsl:apply-templates/></head>
 <xsl:apply-templates select="key('Body',generate-id())" mode="Body"/>
 <xsl:apply-templates select="key('h6',generate-id())" mode="h6"/>
 </section>
 </xsl:template>
 
 <xsl:template match="p" mode="h6">
 <section level="6"><head><xsl:apply-templates/></head>
 <xsl:apply-templates select="key('Body',generate-id())" mode="Body"/>
 </section>
 </xsl:template>
 
 <xsl:template match="p" mode="Body">
 <text><xsl:apply-templates/></text>
 </xsl:template>
 </xsl:stylesheet>
 

provides output

 <?xml version="1.0" encoding="utf-8"?>
 <main>
    <section level="1">
       <head>Top Level Heading</head>
       <text>Some text at this level</text>
       <section level="2">
          <head>Sub Heading 1</head>
          <text>Some text here</text>
       </section>
       <section level="2">
          <head>Sub heading 2</head>
          <text>Some text here</text>
          <section level="3">
             <head>Sub Heading 3</head>
             <text>Body 3 text</text>
             <section level="4">
                <head>Sub Heading 4</head>
                <text>Body 4 text</text>
             </section>
          </section>
       </section>
    </section>
    <section level="1">
       <head>Another Heading</head>
       <text>Some more text here</text>
    </section>
</main>

or Jeni Tennison offers.

<?xml version="1.0"?>
 <xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
     <xsl:output method="xml" indent="yes"/>
  <!--i.e., convert the attribute into element, it's value is 
 related to p-->
 
 <xsl:key  name="next-headings" match="p[@style='Heading2']"
 use="generate-id(preceding-sibling::p[@style='Heading1'
 or @style='Heading2' or @style='Heading3' or @style='Heading4' or
 @style='Heading5' or @style='Heading6'][1])"/>
 
 <xsl:key  name="next-headings" match="p[@style='Heading3']"
 use="generate-id(preceding-sibling::p[@style='Heading1'
 or @style='Heading2' or @style='Heading3' or @style='Heading4' or
 @style='Heading5' or @style='Heading6'][1])"/>
 
 <xsl:key  name="next-headings" match="p[@style='Heading4']"
 use="generate-id(preceding-sibling::p[@style='Heading1'
 or @style='Heading2' or @style='Heading3' or @style='Heading4' or
 @style='Heading5' or @style='Heading6'][1])"/>
 
 <xsl:key  name="next-headings" match="p[@style='Heading5']"
 use="generate-id(preceding-sibling::p[@style='Heading1'
 or @style='Heading2' or @style='Heading3' or @style='Heading4' or
 @style='Heading5' or @style='Heading6'][1])"/>
 
 <xsl:key  name="next-headings" match="p[@style='Heading6']"
 use="generate-id(preceding-sibling::p[@style='Heading1'
 or @style='Heading2' or @style='Heading3' or @style='Heading4' or
 @style='Heading5' or @style='Heading6'][1])"/>
 
 
 <xsl:key  name="immediate-nodes" 
 match="node()[not(@style='Heading1') and
 not(@style='Heading2') and not(@style='Heading3')  and
 not(@style='Heading4') and not(@style='Heading5') and
 not(@style='Heading6')]"
 use="generate-id(preceding-sibling::p[@style='Heading1' or 
 @style='Heading2'
 or @style='Heading3' or @style='Heading4' or @style='Heading5' or
 @style='Heading6'][1])"/>
 
 <!-- Jeni's  Algorithm  -->
 <xsl:template match="doc">
 <main>
 <xsl:apply-templates select="p[@style='Heading1']" />
 </main>
 </xsl:template>
 
 <xsl:template match="p" >
 <xsl:variable name="level"
 select="substring-after(@*[name()='style'],'ing')"/>
 
 <section level="{$level}">
 <head><xsl:apply-templates/></head>
 <xsl:apply-templates select="key('immediate-nodes',generate-id())"
 mode="Body"/>
 <xsl:apply-templates select="key('next-headings',generate-id())" />
 </section>
 </xsl:template>
 
 <xsl:template match="node()">
 <xsl:copy-of select="."/>
 </xsl:template>
 
 <xsl:template match="p" mode="Body">
 <text><xsl:apply-templates/></text>
 </xsl:template>
 </xsl:stylesheet>
 

with same xml input, provides

<?xml version="1.0" encoding="utf-8"?>
<main>
   <section level="1">
      <head>Top Level Heading</head>


      <section level="2">
         <head>Sub Heading 1</head>


         <section level="2">
            <head>Sub heading 2</head>


            <section level="3">
               <head>Sub Heading 3</head>


               <section level="4">
                  <head>Sub Heading 4</head>


               </section>
            </section>
         </section>
      </section>
   </section>
   <section level="1">
      <head>Another Heading</head>


   </section>
  

9.

Styled paragraphs to structured XML

Nicholas Waltham







 -----------Source File ----------
 
 <?xml version="1.0" encoding="UTF-8"?>
 <doc>
  <p style="Normal">Normal</p>
  <p style="L1"> - 1.1</p>
  <p style="L1"> - 1.2</p>
  <p style="L1"> - 1.3</p>
  <p style="L1"> - 1.4</p>
  <p style="L2"> - 1.4.1</p>
  <p style="L2"> - 1.4.2</p>
  <p style="L3"> - 1.4.2.1</p>
  <p style="L1">- 1.5</p>
  <p style="L1">- 1.6</p>
  <p style="Normal">Normal</p>
  <p style="L1">b First Level 1</p>
  <p style="L1">b Second Level 1</p>
  <p style="L2">b First <em>Level</em> 2</p>
  <p style="L2">b Second Level 2</p>
  <p style="L3">b First Level 3</p>
  <p style="L2">c Second Level 2</p>
  <p style="L3">c First Level 3</p>
  <p style="L1">b Third <em>Level</em> 1</p>
  <p style="L1">b Forth Level 1</p>
 </doc>
 

-----------Desired Output ----------


 <html>
 <body bgcolor="#ffffff">
  Normal
  <ol>
 <li> - 1.1</li>
 <li> - 1.2</li>
 <li> - 1.3</li>
 <li> - 1.4</li>
 <ol>
 <li> - 1.4.1</li>
 <li> - 1.4.2</li>
 <ol>
 <li> - 1.4.2.1</li>
 </ol>
 </ol>
 <li>- 1.5</li>
 <li>- 1.6</li>
 </ol>
 Normal
 <ol>
 <li>b First Level 1</li>
 <li>b Second Level 1</li>
 <ol>
 <li>b First <em>Level</em> 2</li>
 <li>b Second Level 2</li>
 <ol>
 <li>b First Level 3</li>
 </ol>
 <li>c Second Level 2</li>
 <ol>
 <li>c First Level 3</li>
 </ol>
 </ol>
 <li>b Third <em>Level</em> 1</li>
 <li>b Forth Level 1</li>
 </ol>
 </body>
 </html>
 

-----------XSL to do the job ----------

 <?xml version="1.0" encoding="UTF-8"?>
 <xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output method="html" version="1.0" encoding="UTF-8" 
 indent="yes"/>
 
 <!--Items in a list which are not the first items /ie list 
 continuation
 items-->
 <xsl:key name='listitems1' match="p[@style='L1'and
 preceding-sibling::p[1][@style='L1'
    or @style='L2' or @style='L3']]"
    use="generate-id(preceding-sibling::p[@style='L1'][1])"/>
 
 <xsl:key name='listitems2' match="p[@style='L2'and 
 preceding-sibling::p[1][
    @style='L2' or @style='L3']]"
    use="generate-id(preceding-sibling::p[@style='L2' and
 preceding-sibling::p[1][@style='L1']][1])"/>
 
 <xsl:key name='listitems3' match="p[@style='L3'and
 preceding-sibling::p[1][@style='L3']]"
    use="generate-id(preceding-sibling::p[@style='L3' and
 preceding-sibling::p[1][@style='L2']][1])"/>
 
 <!--First Items in sublists-->
 <xsl:key name='sublistitems2' match="p[@style='L2' and
 preceding-sibling::p[1][@style='L1']]"
    use="generate-id(preceding-sibling::p[@style='L1'][1])"/>
 
 <xsl:key name='sublistitems3' match="p[@style='L3' and
 preceding-sibling::p[1][@style='L2']]"
    use="generate-id(preceding-sibling::p[@style='L2'][1])"/>
 
 <!--Process Level 1 Lists-->
 <xsl:template match="p[@style='L1']" mode="startlist">
 <xsl:apply-templates select="." mode="Items"/>
 <xsl:apply-templates select="key('sublistitems2', generate-id())"
 mode="openlist"/>
 <xsl:apply-templates select="key('listitems1', generate-id())"
 mode="startlist"/>
 </xsl:template>
 
 <!--Process Level 2 Lists-->
 <xsl:template match="p[@style='L2']" mode="startlist">
  <xsl:apply-templates select="." mode="Items"/>
  <xsl:apply-templates select="key('sublistitems3', generate-id())"
 mode="openlist"/>
  <xsl:apply-templates select="key('listitems2', generate-id())"
 mode="startlist"/>
 </xsl:template>
 
 <!--Process Level 3 Lists-->
 <xsl:template match="p[@style='L3']" mode="startlist">
  <xsl:apply-templates select="." mode="Items"/>
  <xsl:apply-templates select="key('listitems3', generate-id())"
 mode="startlist"/>
 </xsl:template>
 
 <!--Start a list if the contents is not empty-->
 <xsl:template match="p[@style='L1'or @style='L2' or @style='L3']"
 mode="openlist">
 <xsl:if test="string(.)!=''">
 <ol>
 <xsl:apply-templates select="." mode="startlist"/>
 </ol>
 </xsl:if>
 </xsl:template>
 
 <!--Output Individual List Items-->
 <xsl:template match="p[@style='L1'or @style='L2' or @style='L3']"
 mode="Items">
 <xsl:if test="string(.)!=''">
 <li>
 <xsl:copy-of select="node()"/>
 </li>
 </xsl:if>
 </xsl:template>
 
 <!--Pick out the first level 1 item in a list, dump other items -->
 <xsl:template match="p[@style='L1'or @style='L2' or @style='L3']" >
 <xsl:if test="@style='L1' and not 
 (preceding-sibling::p[1][@style='L1'or
 @style='L2' or @style='L3'])">
 <ol>
 <xsl:apply-templates select="." mode="startlist"/>
 </ol>
 </xsl:if>
 </xsl:template>
 
 <xsl:template match="/">
 <html>
 <head>
 </head>
 <body bgcolor="#ffffff">
 <xsl:apply-templates />
 </body>
 </html>
 </xsl:template>
 
 </xsl:stylesheet>

10.

Text before end marker or applying structure

Dimitre Novatchev and Mike Kay



> my problem is easily explained. This is the simple xml:

> <p>some text some text <pb/>more text more text</p>

> In my (html) output I need to process and reproduce the 'some text'
> (ie. up
> to the <pb/>) without the 'more text'. (Of course I also need to
> output
> just
> the 'more text' in another instance.) I'm having some trouble with
> this.
> Also the 'some text' contains tags which need to be processed by the
> stylesheet.

The following is an outline of a general way to process groups of nodes delimited by a specific element. This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="/">
    <xsl:apply-templates select="/*/p/pb"/>
  </xsl:template>
  
  <xsl:template match="p/pb">
    <xsl:element name="group{position()}">
      <xsl:apply-templates 
      select="preceding-sibling::node()[not(self::pb)]
                       [generate-id(following-sibling::pb)
                       = 
                        generate-id(current())
                        ]"/>
    </xsl:element>
    <xsl:if test="position() = last()">
      <xsl:element name="group{position() + 1}">
        <xsl:apply-templates select="following-sibling::node()"/>
      </xsl:element>
    </xsl:if>
  </xsl:template>
</xsl:stylesheet>

when applied on this source.xml:

<html>
  <p>some text1 some text1 
  <pb/>more text2 more text2
  <pb/>more text3 more text3
  </p>
</html>

produces:

<group1>some text1 some text1 
  </group1>
<group2>more text2 more text2
  </group2>
<group3>more text3 more text3
  </group3>

As we see, all nodes between two consecutive markers will be processed as a group.

Michael Kay offers

These problems are slightly tricky, and the exact details of the solution depend on the exact nature of the problem.

A typical example of the problem is to convert

<p>
line1<br/>
line2<br/>
linen
</p>

to

<p>
<line>line1</line>
<line>line2</line>
<line>linen</line>
</p>

The best way to tackle this, provided that it doesn't exceed the recusion depth allowed by most processors (say 1000), is to use apply-templates recursing along the following-sibling axis. In simple cases the following works:

<xsl:template match="p">
<p>
  <xsl:apply-templates select="child::text()[1]" mode="along"/>
</p>
</xsl:template>

<xsl:template match="text()" mode="along">
<line><xsl:value-of select="."/></line>
<xsl:apply-templates select="following-sibling::text()[1]"
mode="along"/>
</xsl:template>

Unfortunately this isn't very robust. Some processors (wrongly) split text nodes when entity references are encountered, and all (rightly) split them when comments are encountered; also your text might have other elements such as <i> embedded in it.

So a more robust technique is to match the br elements, and each time you hit a br, copy all the preceding siblings whose next br element is this one. This can be done as follows:

<xsl:template match="br" mode="along">
<xsl:copy-of select="preceding-sibling::node()
    [generate-id(following-sibling::br[1])=generate-id(current())]"/>
<xsl:apply-templates select="following-sibling::br[1]"/>
</xsl:template>

Then you have to mop up the last group of elements (the ones that have no following br element) which you can do in the parent template.

This has O(n^n) performance and it can recurse deeply, so it isn't ideal. A more efficient solution is to treat it as a grouping problem. The sibling elements between two br elements constitute a group, the grouping key for this group is the generate-id() of the most recent br (or p) element. So you can use Muenchian grouping with the grouping key:

<xsl:key name="k" match="p/node()" 
use="concat(generate-id(..), generate-id(preceding-sibling::br[1]"/>

It's easier in XSLT 2.0. If you can use Saxon 7.4, try:

<xsl:for-each-group select="child::node()" group-starting-with="br">
<line><xsl:copy-of select="current-group()"/></line>
</xsl:for-each-group>

11.

Making flat files strucutred hierarchically

Charles Knell

I am in search for a way to sructuring flat files. this source document as an example:

<?xml versio="1.0"?>

<root>
   <title>Titel 1</title>
  <para>Para 1</para>
  <para>Para 2</para>
  <para>Para 3</para>
  <title>Title 2</title>
  <para>Para 1</para>
  <para>Para 2</para>
  <title>Title 3</title>
  <para>Para 1</para>
  <para>Para 2</para>
</root>

After XSLT 1,0 transformation the output should look like

<?xml version="1.0"?>
<root>

   <layer>
     <title>Titel 1</title>
     <para>Para 1</para>
     <para>Para 2</para>
     <para>Para 3</para>
  </layer>
  
  <layer>
    <title>Title 2</title>
    <para>Para 1</para>
    <para>Para 2</para>
  </layer>

  <layer>  
    <title>Title 3</title>
    <para>Para 1</para>
    <para>Para 2</para>
  </layer>
</root>

There should be plenty of examples in the archives of this list. I know that I have done several of them myself. Here is another solution to this frequently-asked question.

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output method="xml" indent="yes" encoding="UTF-8" />
  <xsl:strip-space elements="*" />

  <xsl:template match="/">
    <xsl:apply-templates />
  </xsl:template>

  <xsl:template match="root">
    <xsl:apply-templates select="title" />
  </xsl:template>

  <xsl:template match="title">
    <xsl:variable name="place" select="count(preceding-sibling::title)" />
    <layer>
      <xsl:copy-of select="." />
      <xsl:apply-templates select="following-sibling::para">
        <xsl:with-param name="slot" select="$place" />
      </xsl:apply-templates>
    </layer>
  </xsl:template>

  <xsl:template match="para">
    <xsl:param name="slot" />
    <xsl:choose>
      <xsl:when test="count(preceding-sibling::title) = $slot + 1">
        <xsl:copy-of select="." />
      </xsl:when>
      <xsl:otherwise />
    </xsl:choose>
  </xsl:template>

</xsl:stylesheet>

12.

Grouping flat element structures to make lists

Jay Bryant, DC and MK

The trick is picking just the bits you want in the list. I took your sample lists and made a sample XML file and a stylesheet to demonstrate one way to do it (with apply-templates).

Here's the XML file (with non-list elements intermingled with the lists):

<?xml version="1.0" encoding="UTF-8"?>
<lists>
  <par class="Listbegin">Fruit</par>
  <par class="Listitem">Apple</par>
  <par class="Listitem">Orange</par>
  <par class="Listitem">Pear</par>
  <something>thing1</something>
  <something>thing2</something>
  <par class="Listbegin">Colours</par>
  <par class="Listitem">Red</par>
  <par class="Listitem">Green</par>
  <par class="Listitem">Blue</par>
</lists>

And here's the XSL:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

  <xsl:template match="lists">
    <lists>
      <xsl:for-each-group select="*" 
group-starting-with="par[@class='Listbegin']">
        <list>
          <xsl:apply-templates 
select="current-group()/self::par[@class='Listbegin']|current-group()/self::par[@class='Listitem']"/>
        </list>
      </xsl:for-each-group>
    </lists>
  </xsl:template>

  <xsl:template match="par[@class='Listbegin']">
    <head><xsl:value-of select="."/></head>
  </xsl:template>

  <xsl:template match="par[@class='Listitem']">
    <item><xsl:value-of select="."/></item>
  </xsl:template>

</xsl:stylesheet>

And here's the resulting output:

<?xml version="1.0" encoding="UTF-8"?>
<lists>
  <list>
    <head>Fruit</head>
    <item>Apple</item>
    <item>Orange</item>
    <item>Pear</item>
  </list>
  <list>
    <head>Colours</head>
    <item>Red</item>
    <item>Green</item>
    <item>Blue</item>
  </list>
</lists>

David Carlisle offers

<sect>

<foo/>
<xyz/>
<xyz/>
<par class="Listbegin">Fruit</par>
<par class="Listitem">Apple</par>
<par class="Listitem">Orange</par>
<par class="Listitem">Pear</par>
<par class="Listbegin">Colours</par>
<par class="Listitem">Red</par>
<par class="Listitem">Green</par>
<par class="Listitem">Blue</par>
<xyz/>
<bar/>
<par class="Listbegin">Fruit</par>
<par class="Listitem">Apple</par>
<par class="Listitem">Orange</par>
<par class="Listitem">Pear</par>
</sect>
<xsl:stylesheet
   version="2.0"
   xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>


<xsl:output indent="yes"/>

<xsl:template match="*">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates/>
</xsl:copy>
</xsl:template>

<xsl:template match="sect">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:for-each-group group-adjacent="@class='Listitem'" select="*">
<xsl:choose>
<xsl:when test="self::par[@class='Listitem']">
<list>
<head><xsl:apply-templates
select="preceding-sibling::par[1]/node()"/></head>
<xsl:apply-templates select="current-group()"/>
</list>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"/>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</xsl:copy>
</xsl:template>

<xsl:template match="par[@class='Listbegin']"/>


</xsl:stylesheet>

and the output


$ saxon8 list.xml list.xsl
<?xml version="1.0" encoding="UTF-8"?>
<sect>
   <foo/>
   <xyz/>
   <xyz/>
   <list>
      <head>Fruit</head>
      <par class="Listitem">Apple</par>
      <par class="Listitem">Orange</par>
      <par class="Listitem">Pear</par>
   </list>
   <list>
      <head>Colours</head>
      <par class="Listitem">Red</par>
      <par class="Listitem">Green</par>
      <par class="Listitem">Blue</par>
   </list>
   <xyz/>
   <bar/>
   <list>
      <head>Fruit</head>
      <par class="Listitem">Apple</par>
      <par class="Listitem">Orange</par>
      <par class="Listitem">Pear</par>
   </list>
</sect>

In XSLT 2.0 the context item can be either a node or an atomic value. When you do this:

<xsl:variable name="split" 
              select="tokenize(replace($AName,'(.)([A-Z])','$1 $2'), ' ')"/>

        <xsl:for-each select="$split">

tokenize() returns a sequence of strings, so within the for-each the context item is a string. This means that when you use a relative path expression TopContext/*... the system doesn't know what it's relative to. Perhaps you intended it to be relative to the node that was the context item outside the for-each, in which case the answer is to bind that to a variable, and use the variable at the start of the path expression:

<xsl:variable name="c" select="."/>
<xsl:variable name="split" 
              select="tokenize(replace($AName,'(.)([A-Z])','$1 $2'), ' ')"/>

        <xsl:for-each select="$split">
           <xsl:value-of select="$c/TopContext/*...

But actually looking at your code I think you might actually have intended it to be relative to the root of that document, in which case you would want

<xsl:variable name="c" select="/"/>

Whichever it is, you need to be explicit.