XML, duplicate elements, remove duplicates

Duplicates

1. Remove duplicate nodes
2. Remove duplicates from a list
3. Testing for duplicate id values on two elements.
4. Removing duplicates
5. Sort by number of occurrences and remove duplicates.
6. List all unique element and attribute names

1.

Remove duplicate nodes

David Carlisle

select="reference[not(@name=following::reference/@name)]"
will return you a list of reference elements with no two having the
same name.

            

2.

Remove duplicates from a list

Clark C. Evans


Q: Expansion.
  A part of my xml looks like this:
  
 <location>
   <state>xxxx</state>
 </location>
  
 <location>
    <state>yyyy</state>
 </location>
 
  <location>
    <state>xxxx</state>
 </location>
 
 The desired output is:
 
   xxxx
   yyyy
   
 That is, duplicate values of state should not be printed.
 Can this be done? 

 
   <xsl:variable name="unique-list"
     select="//state[not(.=following::state)]" />   
 
   <xsl:for-each select="$unique-list">
 <xsl:value-of select="." />
   </xsl:for-each>

            

3.

Testing for duplicate id values on two elements.

Jeni Tennison

If there is an element that has the same @id attribute as the current element, then:

  @id = (preceding::*/@id | following::*/@id)

would be true. You can get elements that have repeated ids with:

  //*[@id = (preceding::*/@id | following::*/@id)]

Of course if you test the above expression as a boolean, then it will return true if there are any repeated ids in the document.

You can list the IDs that are repeated (once per ID) with:

  //*[not(preceding::*/@id = @id) and
      following::*/@id = @id]

The preceding:: and following:: axes are, of course, horribly inefficient. You could use a recursive solution instead, or you could use keys:

<xsl:key name="ids" select="*" use="@id" />

If, for an element, the 'ids' key returns more than one node, then it's a repeated id, so you can test whether there are any repeats with:

  //*[key('ids', @id)[2]]

[Aside: I've started using positional predicates to test whether a node set has more than a certain number of nodes, although perhaps any processor that optimises '$nodes[2]' will also optimise 'count($nodes) &gt; 1'. Any thoughts?]

If you want to get a list (without repeats) of the repeated IDs using the key() method, then you can use the usual:

  //*[key('ids', @id)[2] and
      count(.|key('ids', @id)[1]) = 1]

[or actually, by extension to the above aside:

  //*[key('ids', @id)[2] and
      not((.|key('ids', @id)[1])[2])]

making identity-testing of nodes even more obscure! :)]

Testing this across documents is harder because (a) nodes from different documents aren't related to each other through axes and (b) key() is scoped within the document of the context node. Probably the simplest solution is to create a node set variable holding copies of the nodes from the relevant documents, and then test that as if it were a document on its own:

  <xsl:variable name="IDed-elements">
    <xsl:copy-of select="(/ | document('foo.xml'))//*[@id]" />
  </xsl:variable>
  <xsl:if test="$IDed-elements//*[key('ids', @id)[2]]">
    <!-- repeated IDs -->
  </xsl:if>

Alternatively, you could use a recursive method.

4.

Removing duplicates

David Carlisle



> can you elaborate your approach using count()?
	

It isn't really mine, but anyway in the beginning, the general method of getting rid of duplicates (or equivalently getting the first item in each group of related items. was to go something like

select="foo[not(.=preceeding::foo)]"

i.e. select all foo's that don't have the same value as an earlier one.

This works but has quadratic behaviour in the number of nodes being searched.

Steve Meunch had an insight into using xsl:key to improve things. This is populatised by Jeni, who has a good description of it at her site. It's the method I used in my "dynamic" example. First you specify a key, then a node is the first item in a group 9ie the one you want if you are discarding duplicates) if it is the first node in the node set returned by the key.

The only trick part is that XSL doesn't have a node identity test. Given a node $x and a node set $Y how do you tell if $x is in $Y.

two basic methods, one uses generate-id to give strings which you can compare to give node identity test, the other uses set theory.

count($x | $Y) is the number of elements in the set Y union the singlton set $x. This will be equal to count($Y) if $x is already in $Y and equal to count($Y)+1 otherwise.

so...

 select="/records/record/*[count(.|key('x',name())[1])=1]">

selects all element child nodes of record for whic count(.|key('x',name())[1])=1 (rather than 2) ie it selects those nodes that are in the set key('x',name())[1] ie are the first node of each group. IE subsequent duplicates are not selected.

5.

Sort by number of occurrences and remove duplicates.

Mukul Gandhi


> I have an xml doc that is the result of a keyword
> search, which list the
> keyword, and all pages by id that have that instance
> of the keyword. The
> output looks like this.

> <?xml version="1.0" encoding="UTF-8"?>
> <docroot>
>  <token>
>  <pageid>1</pageid>
>  <pageid>3</pageid>
>  <pageid>84</pageid>
>  </token>
>  <token>
>  <pageid>3</pageid>
>  <pageid>5</pageid>
>  <pageid>84</pageid>
>  </token>
>  <token>financial aid
>  <pageid>5</pageid>
>  <pageid>84</pageid>
>  </token>
> </docroot>

> I need to transform this this into a grouped and
> sorted list of page
> id's, so that the instances of pageid that occur the
> most frequently are
> first, and so that each pageid is listed only once.
> The above xml would
> then look like this.

> <docroot>
>  <pageid>84</pageid>
>  <pageid>3</pageid>
>  <pageid>5</pageid>
>  <pageid>1</pageid>
> </docroot>

Please try the XSL below --

We need to use Muenchian method for grouping. I have used xalan:nodeset extension function.

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:xalan="http://xml.apache.org/xalan";
exclude-result-prefixes="xalan">
   <xsl:output method="xml" version="1.0"
encoding="UTF-8" indent="yes"/>
   <xsl:key name="x" match="docroot/token/pageid"
use="."/>
   <xsl:template match="/">
     <xsl:variable name="treefrag">
	<docroot>
	  <xsl:for-each select="docroot/token/pageid">
	    <xsl:if test="generate-id(.) =
generate-id(key('x', .)[1])">
	      <page>
		<pageid>
		  <xsl:value-of select="."/>
		</pageid>
		<no_of_times>
		<xsl:value-of select="count(key('x', .))"/>
		</no_of_times>
	      </page>
	    </xsl:if>
	  </xsl:for-each>
	</docroot>
      </xsl:variable>
      
      <xsl:call-template name="process_tree">
	 <xsl:with-param name="tree"
select="xalan:nodeset($treefrag)"/>
      </xsl:call-template>
</xsl:template>

<xsl:template name="process_tree">
   <xsl:param name="tree"/>
   <docroot>
     <xsl:for-each select="$tree/docroot/page">
	<xsl:sort select="no_of_times" data-type="number"
order="descending"/>
	  <pageid>
	    <xsl:value-of select="pageid"/>
	  </pageid>
      </xsl:for-each>
    </docroot>
</xsl:template>
</xsl:stylesheet>

6.

List all unique element and attribute names

Mike Kay, Andrew Welch





I need to list out all elements and attribute (unique) in a 
> text file for mapping with other XML file.

> I am able to get all the elements and attributes but I am 
> unable to achieve the uniqueness. Can any body help on this.

In XSLT 2.0 it's simply

distinct-values(//*/name())
distinct-values(//@*/name())

If you really need to do it with XSLT 1.0, eliminating duplicates is essentially the same problem as grouping, and you can use the Muenchian grouping approach.

The preceding-sibling grouping technique isn't going to work (a) because your nodes are not siblings of each other, and (b) because it only works where the grouping key is the string-value of the node, not where it is some other function of the node (here, it's name). Muenchian grouping works for any string-valued function of a node.

If you are stuck with 1.0 then you need to use a key:

<xsl:key name="names" match="*|@*" use="name()"/>

with

<xsl:for-each select="//*|//*/@*">
  <xsl:if test="generate-id(.) = generate-id(key('names', name(.))[1])">
    <xsl:value-of select="local-name()"/>

In 2.0 you can use distinct-values() or xsl:for-each-group, but it always good to learn this technique.