Indexing XML

Indexing in XSLT

1. Index into XML files
2. Building a grouped index in XSL

1.

Index into XML files

DaveP

Background.

I kept having requests for a better index for the XSLT faq from users. Perhaps because its grown up over time, or I'm no librarian, but others view of where I should file something rarely agreed with mine. An index seemed an obvious answer.

I guessed I could use XSLT to do it, and after thinking for a while, decided I wanted a general purpose system, that I could expand as needed. End goal was an index file(s) of terms, the data source to be indexed, and an output of an html file pointing into the resultant html files derived from the data source. That was the requirement

Issues

  • The data source I was indexing was in XML, the links needed to be in HTML, having been created by someone elses stylesheets (Norm Walsh's docbook Stylesheets) - which means I need to modify them as little as possible.

  • The datasources were in about 200 seperate files (last time I looked), and were in no way static, being updated weekly. So the index build had to be automated.

  • I knew it would bust XSLT 1.0, because I would have to use node-set to convert to node-sets from result tree fragments, but I guess that will be fixed as soon as 1.1 is released.

  • I was neither familiar with the document function to search the input, nor the keys function, both of which I knew I would need to avoid duplicates in the output.

  • If I wanted indexing to the nearest parent element of the node containing the sought term, I would have to modify the docbook stylesheets.

  • Apart from that it was just difficult

With that in front of me I made a start! The source (term) files was a simple word list, such as the example below.

<?xml version="1.0"?>
<words>
<word>boolean(</word>
<word>ceil(</word>
<word>ceiling(</word>
<word>concat(</word>
<word>contains(</word>
<word>count(</word>
<word>current(</word>
<word>document(</word>
<word>element-available(</word>
<word>false(</word>
<word>floor(</word>
<word>format-number(</word>
<word>function-available(</word>
<word>generate-id(</word>
etc. 

The terminating left paren is there to seperate a word current from the function current(). I didn't want that to show in the final output, but sub-string() provided a quick answer to that one.

I knew it would be a two stage affair. I had to walk through each data file comparing certain contents with each word in the word list. This meant a double for-each, to change context each time.

The intermediate file format is again straightforward. I could have done it... correction, it could have been done, in one stylesheet, but I felt that beyond me, and there would be no way I could maintain it. It looks like this

<?xml version="1.0" encoding="utf-8"?>
<faqindex>
	<popular>
		<pair>
			<word>count</word>
			<file>list.html#para-13</file>
		</pair>
		<pair>
			<word>reference</word>
			<file>list.html#para-9</file>
		</pair>
	</popular>
	<functions/>
	<attributes/>

	<functions/>
	<attributes/>

	<functions>
		<pair>
			<word>document</word>
			<file>N169.html#para-9</file>
		</pair>
	</functions>
</faqindex>

Within the outer wrapper, there are three elements, matching each of the source term files, each of which takes the same format. These three elements take the form of a wrapper for a word and file pair. The first is the term I wanted indexing, the second the file plus the fragment identifier of the first parent of the data source element containing the term. The location within the html file is identified using a format obtained from the number function, applied to the element, which only has to be unique to the file. Jeni Tennison provided this routine.

So, thats the index term files, the intermediate file produced, all thats needed now is a couple of stylesheets to search the data source and index term files, and a second one to produce the actual html index. This was made simpler for me by having the output files all in the same directory. I could have produced another docbook compliant file, and may sometime, but for the moment I'll keep it simple.

Firstly then, the stylesheeet to produce the intermediate file. I'll break it up to explain each section of code. The file is used with the data source files as the input document, creating a temporary output file used by the second stylesheet.

  <!DOCTYPE xsl:stylesheet [
  <!ENTITY sp "<xsl:text> </xsl:text>">
  <!ENTITY dot "<xsl:text>.</xsl:text>">
  <!ENTITY nl "&#10;">
<!ENTITY  rsqb	"&#x005D;"><!--	# RIGHT SQUARE BRACKET -->
<!ENTITY  lt	"&#x003C;"><!--	# LESS-THAN SIGN -->
<!ENTITY  gt	"&#x003E;"><!--	# GREATER-THAN SIGN -->


]>


 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version="1.0"

                xmlns:xt="http://www.jclark.com/xt"
                extension-element-prefixes="xt">

   <xsl:strip-space elements="*"/>
<xsl:output  
  method="xml" 
  indent="yes"   />
    

Standard stylesheet, needing James Clarks xt:node-set extension to produce the node-sets from result tree fragments. These use the xt: prefix.


  <xsl:template match="/">
    <faqindex>
    <xsl:apply-templates/>
  </faqindex>
  </xsl:template>

  <xsl:template match="website">
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="homepage">
   <xsl:apply-templates/>
  </xsl:template>
    

The above templates simply take me down to the element of interest within the data source files which are being indexed. Change these to match the wrapper above the ones you are interested in!

<!-- Process each webpage child in directory xsl -->
<xsl:template match="webpage[config/@value='xsl']">
  <xsl:variable name="file" select="config[@param='filename']/@value"/>
  <xsl:variable name="answers"
                select="qandaset/qandaentry/answer/screen |
                        qandaset/qandaentry/answer/para" />

    

The predicate on the template match is simply because I have xml files which I don't want to index. All the ones of interest contain an attribute 'value' to element config which contains the value xsl.

Two variables are created:

The answer variable holds all the elements of interest which may contain one of the terms I'm searching for.

The file variable holds the filename of the html file which will be produced from the xml data file I'm searching. Does that make sense? It happens to be held in the config/@param='filename' path in the source file. Thanks Norm!

 
  <xsl:choose>

    <xsl:when test="qandaset">
     <!-- Check, for ea word popular file-->
      <popular>
        <xsl:for-each select="document('popular.xml')/words/word">
          <xsl:variable name="word" select="."/>
          <xsl:for-each select="$answers[contains(., $word)]">
            <pair>
              <word>
                <xsl:value-of select="$word"/>
              </word>
              <file>
                 <xsl:value-of select="$file"/>
                <xsl:text>#</xsl:text>
                <xsl:apply-templates select="." mode="get-id" />
              </file>
            </pair>
          </xsl:for-each>
        </xsl:for-each>
      </popular>

    

The document function is used to pick up the words I want indexing. Don't, as I did, initially, forget the / before the words element (i.e. try to use words/word). Remember that there is the funny root node to account for!

The xsl:choose is used to isolate the qandaset children from any others. This is merely a precaution for future expansion. The otherwise clause simply processes children, since any webpage may contain other webpages which may be of interest for indexing. Good is docbook!

The actual when clause does the work.

The logic reads something like: for-each word in the term file, search the data source elements for any which contain that word. I could have reduced the file size output by using an xsl:if, but its not particularly large anyway, so I didn't! The substring-before function (in the for-each below) is used to output only the word, not the left parenthesis, into the output file.

This process is repeated for each of the other term files (words.xml and attributes.xml. Each has a different terminator used for my own purpose. The simplest one is the popular.xml example, which outputs only the word itself.



      <!-- Check, for ea word FUNCTIONS file-->
      <functions>
        <xsl:for-each select="document('words.xml')/words/word">
          <xsl:variable name="word" select="."/>
          <xsl:for-each select="$answers[contains(., $word)]">
            <pair>
              <word>
                <xsl:value-of select="substring-before($word,'(')"/>
              </word>
              <file>
                 <xsl:value-of select="$file"/>
                <xsl:text>#</xsl:text>
                <xsl:apply-templates select="." mode="get-id" />
              </file>
            </pair>
          </xsl:for-each>
        </xsl:for-each>
      </functions>


      <!-- Check, for ea word attributes file.-->
      <attributes>
	<xsl:for-each select="document('attributes.xml')/words/word">
	  <xsl:variable name="word" select="."/>
	  <xsl:for-each select="$answers[contains(., $word)]" >
	  <pair>
	    <word>
	      <xsl:value-of select="substring-before($word,'=')"/>
	    </word>
	    <file>
	      <xsl:value-of select="$file"/>
	      <xsl:text>#</xsl:text>
	      <xsl:apply-templates select="." mode="get-id" />
	    </file>
	  </pair>
	</xsl:for-each>
	</xsl:for-each>
      </attributes>
    </xsl:when>


    <xsl:otherwise>
      <xsl:if test="webpage">
        <xsl:apply-templates />
      </xsl:if>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

    

This little template is a jem. Guess who designed it? Thanks Jeni Tennison (again)! Its used twice. Once here to provide (whether needed or not) to provide a target fragment identifier for the href, and within docbook, for the same elements, to provide the id value!

The actual value generated is the element name (para or screen), concatenated with a hyphen and the value of the xsl:number which is sufficiently unique within the file! Very clever.

Just in case Norm wants to pick it up, the docbook file modified is firstly verbatim.xsl, which I modified to read

<xsl:template match="programlisting|screen|literallayout[@class='monospaced']">
  <pre class="{name(.)}">
    <xsl:attribute name="id">
      <xsl:apply-templates select="." mode="get-id" />
    </xsl:attribute><xsl:apply-templates/></pre>
</xsl:template>

And block.xsl, modified as shown below.

<!-- Modified, DP to add the id value by call to new template -->
<xsl:template match="para">
  <p>
    <xsl:attribute name="id">
      <xsl:apply-templates select="." mode="get-id" />
    </xsl:attribute>
    <xsl:apply-templates/>
  </p>
</xsl:template>

The same template is added to Norms docbook ones, to provide the function which similarly calculates the id value for the para and screen. If you want to index into various elements, then a similar call can be added to any list of elements. Easy?


<xsl:template match="para | screen" mode="get-id">
  <xsl:value-of select="local-name()" />
  <xsl:text>-</xsl:text>
  <xsl:number level="multiple" />
</xsl:template> 

  <xsl:template match="*"/>
</xsl:stylesheet>

And thats it. The output file is exemplified above, and is used as an input to the second stylesheet.

The second stage of the index production has the simple input file and the desired output is a list of links, highlighting the words and linking to the file which contains those words. It could perhaps be improved by providing more information (perhaps the title of the file?), but in order to present well, I've kept it simple.

The following stylesheet does that, chopping it up into three tables to minimise file size. Very straightforward and just about self documenting. Note the double usage of xt:node-set to sort and re-use.

   <!DOCTYPE xsl:stylesheet [
  <!ENTITY sp "<xsl:text> </xsl:text>">
  <!ENTITY dot "<xsl:text>.</xsl:text>">
  <!ENTITY nl "
">
<!ENTITY  rsqb	"]"><!--	# RIGHT SQUARE BRACKET -->
<!ENTITY  lt	"<"><!--	# LESS-THAN SIGN -->
<!ENTITY  gt	">"><!--	# GREATER-THAN SIGN -->


]>


 <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                 version="1.0"

                xmlns:xt="http://www.jclark.com/xt"
                extension-element-prefixes="xt">

   <xsl:strip-space elements="*"/>
<xsl:output  
  method="html" 
  indent="yes"   />


  <xsl:template match="/">
    <html>
      <head>Index</title>
      </head>
      <body>
	<xsl:apply-templates/>
      </body>
    </html>
  </xsl:template>

<xsl:template match="faqindex"> 
     <H2>Index</H2> 
     <p>Rev 1.2 2 Sept 2000:</p>




     <!-- Sort the functions --> 
       <xsl:variable name="fns"> 
        <xsl:for-each select="functions[*]"> 
          <xsl:sort data-type="text" select="pair/word"/> 
          <xsl:copy-of select="." /> 
        </xsl:for-each> 
       </xsl:variable> 

<!-- now convert that to a node-set and step through it --> 
  <H3><a name="fns">XSLT functions</a></H3> 
 
    

I chose four column output which matched my usage. I could have used Mike Browns idea to terminate the last row, but I seldom get around to such niceties! Also of note, I failed to use the Muenchian technique to isolate duplicate files! The three tables are identical in usage, hence I only show one.



<!-- 4 column output is about right for this word list. -->
  <xsl:variable name="cols" select="4"/>


<!-- Try and tabulate the functions. -->
  <xsl:variable name="sortedFns">
    <xsl:for-each select="xt:node-set($fns)/functions[*]"> 
      <xsl:variable name="fn" select="pair/file"/> 
      <xsl:if test="not(preceding-sibling::attributes/pair/file= $fn)"> 
	<a href="{pair/file}">* <xsl:value-of select="pair/word"/></a><br /> 
      </xsl:if> 
    </xsl:for-each> 
  </xsl:variable>


  <table border="1">
    <xsl:for-each select="xt:node-set($sortedFns)/a[position() mod $cols = 1]">
      <tr> <xsl:apply-templates
	select=". | following-sibling::a[position() < $cols]" mode="Frow"/> 
      </tr>
    </xsl:for-each>
  </table>




 
   </xsl:template> <!-- End of main template! -->




<!-- Now the general templates, used all 3 secttions -->



   <xsl:template match="a" mode="row">
     <td>
       <xsl:copy>
	 <xsl:copy-of select="@*"/>
	 <xsl:value-of select="."/>
       </xsl:copy></td>
   </xsl:template>
 

   <xsl:template match="a" mode="Frow">
     <td>
       <xsl:copy>
	 <xsl:copy-of select="@*"/>
	 <xsl:value-of select="."/>()
       </xsl:copy></td>
   </xsl:template>
 





  <xsl:template match="*">
    <p style= "color:red">*********<xsl:value-of select="name(..)"/>/<xsl:value-of select="name()"/>*********</p>
  </xsl:template>
 
</xsl:stylesheet>

Rather a long faq, but since I struggled with the principles for so long, I thought I could save others having a similar need.

Hope you like it! DaveP

2.

Building a grouped index in XSL

Michael Kay


> I need to build an index in an XSL stylesheet that has a
> variable number of
> entry groupings ( numbers, a-e, f-i, etc..).  

	

Use a Muenchian grouping, as described on www.jenitennison.com, with a grouping key of

translate(substring(., 1, 1), "abcdefg...ABCDEFG...", 
	  "aaaaafffffiiiii...")