XSLT: Personal Name Sorting Conundrum

XSLT: Personal Name Sorting Conundrum
OCLC Online Computer Library Center, Inc.
Andrew Houghton
Andrew Houghton
urn:uuid:BCCCD5D0-FEAF-4674-BD03-1ACA49308BCA
2005-03-06
2005-03-06
2005-06-06
2005-03-03
2005
1.0.1
application/xhtml+xml
en-US
Text
Publicly accessible
lcsh
Computer programmers
Computer software developers
ddc
006.74
lcc
QA76.7-QA76.73
lcsh
XPath (Computer program language)
XSLT (Computer program language)
http://www.w3.org/TR/2001/REC-xhtml11-20010531/xhtml11.html
http://www.w3.org/TR/1999/REC-CSS1-19990111
http://www.w3.org/TR/2004/REC-xml-20040204/
http://www.w3.org/TR/1999/REC-xml-names-19990114/
http://staff.oclc.org/~houghtoa/repository/articles/BCCCD5D0-FEAF-4674-BD03-1ACA49308BCA
http://staff.oclc.org/~houghtoa/repository/articles/XsltPersonalNameSortingConundrum.1.0.1/
http://staff.oclc.org/~houghtoa/repository/articles/XsltPersonalNameSortingConundrum/
© Copyright 2005 OCLC Online Computer Library Center, Inc.  All rights reserved. 

Table of Contents

  1. Introduction
  2. Analysis
  3. Solution
  4. Summation
  5. References

Introduction

A question was sent to the XML4LIB listserv by John Fitzgibbon from the Galway Public Library He was attempting to sort an XHTML document containing personal names with an XSLT script The personal names consisted of either a given name followed by a family name or just a family name.  The personal names were enclosed in a span tag that contained a class attribute with the value person

His initial XHTML document looked something like:

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE html
  PUBLIC  "-//W3C//DTD XHTML 1.1//EN"
          "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"
>
<html xml:lang='en' xmlns='http://www.w3.org/1999/xhtml'>
  <head>
    <title>Personal Names</title>
    <meta http-equiv='Content-Type' content='application/xhtml+xml; charset=utf-8'/>
  </head>
  <body>
    <span class='person'>Jones</span>
    <span class='person'>Mick Jagger</span>
    <span class='person'>Smith</span>
    <span class='person'>John Fitzgibbon</span>
    <span class='person'>Adams</span>
    <span class='person'>John Smith</span>
    <span class='person'>Fred Fitzgibbon</span>
  </body>
</html>

His initial XSLT script looked something like:

<?xml version='1.0' encoding='utf-8'?>
<xsl:transform version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
  <xsl:output method='text' encoding='utf-8' media-type='text/plain'/>
  <xsl:template name='Root' match='/'>
    <xsl:value-of select="string('&#10;')"/>
    <xsl:for-each select="/html/body//span[@class='person']">
      <xsl:sort select="substring-after(.,' ')"/>
      <xsl:sort select="."/> 
      <xsl:sort select="substring-before(.,' ')"/>
      <xsl:choose>
        <xsl:when test="contains(.,' ')">
          <xsl:variable name="first" select="substring-before(.,' ')"/>
          <xsl:variable name="last"  select="substring-after(.,' ')"/>
          <xsl:value-of select="concat($last,', ',$first,'&#10;')"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="concat(.,'&#10;')"/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:for-each>
  </xsl:template>
</xsl:transform>

Running his XSLT script against his XHTML document produced these results:

Adams
Jones
Smith
Fitzgibbon, Fred
Fitzgibbon, John
Jagger, Mick
Smith, John

The problem he encountered was: all the personal names consisting of only a family name were sorted as a group and appeared first in the alphabetical sequence.  This was followed by a sorted group of the personal names that consisted of both a given and family name. 

The desired result he was trying to achieve was to have a single list of alphabetical personal names rather than the two alphabetical groups of personal names:

Adams
Fitzgibbon, Fred
Fitzgibbon, John
Jagger, Mick
Jones
Smith
Smith, John

Analysis

So why did he get the results sorted as two groups of alphabetic personal names?  To answer this question we must look at his XSLT script and digest the xsl:sort tags:

<xsl:sort select="substring-after(.,' ')"/>
<xsl:sort select="."/> 
<xsl:sort select="substring-before(.,' ')"/>

According to section 10 Sorting in the XSLT specification, the first xsl:sort tag is used as the primary sort key and each subsequent xsl:sort tags are used as additional sort keys.  The first xsl:sort tag will sort on the string that comes after the first breaking space in the personal name.  For personal names that have both a given and family name, the family name will be sorted since the personal names are given in direct order.  However, what happens for personal names that only have a family name? 

The following table lists the personal names and their associated sort keys:

Personal Name Sort Key 1 Sort Key 2 Sort Key 3
Jones   Jones  
Mick Jagger Jagger Mick Jagger Mick
Smith   Smith  
John Fitzgibbon Fitzgibbon John Fitzgibbon John
Adams   Adams  
John Smith Smith John Smith John
Fred Fitzgibbon Fitzgibbon Fred Fitzgibbon Fred

According to the XPath specification, the substring-after function will return an empty string when it does not find the test string.  The test string, a breaking space, will not be found in personal names that contain only a family name, so the sort key value will evaluate to an empty string.  When sort keys contain the same value, the data will be grouped together in input order.  This explains why his results appeared as two groups of data:

  1. personal names containing only a family name
  2. personal names containing both the given and family name

The second xsl:sort tag will sort the results by the entire personal name.  However, the second xsl:sort tag is only used when the first xsl:sort tag contains sort keys that are identical.  The first xsl:sort tag generated an empty string for all personal names that only contain family names.  Therefore, the second xsl:sort tag will sort the first group of personal names into alphabetical order.  In addition, we could have identical sort keys for personal names that contain both a given and family name, e.g., same family name different given names.  Under these circumstances the second xsl:sort tag will sort them into alphabetical order because the personal names are given in direct order, e.g., given name followed by family name. 

Finally, the third xsl:sort tag will sort the results by the string that comes before the first breaking space in the personal name.  However, the third xsl:sort tag is only used when the second xsl:sort tag contains sort keys that are identical.  For personal names that contain only a family name, it is possible that there could be duplicate family names during the second xsl:sort tag operation.  In this situation, the third xsl:sort tag will try to sort on the string that comes before the first breaking space in the personal name.  According to the XPath specification's the substring-before function will return an empty string when it does not find the test string.  This behavior will cause can empty string to be returned for all personal names that contain only a family name and will not cause a change in the sort order. 

For personal names that contain both a given and family name, it is also possible that there could be identical keys during the second xsl:sort tag operation.  In this situation, the third xsl:sort tag will sort these by the given name.  However, you will note that the second xsl:sort tag operation sorted the personal names by the entire string which is in direct order, e.g., given name first followed by family name.  Therefore, the third xsl:sort tag will not cause a change in the sort order, in this case either. 

The following table lists the personal names in their original order followed by their order after each sort key has been applied.  Read down the columns rather than across the rows:

Original Order Sort Order 1 Sort Order 2 Sort Order 3
Jones Jones Adams Adams
Mick Jagger Smith Jones Jones
Smith Adams Smith Smith
John Fitzgibbon John Fitzgibbon Fred Fitzgibbon Fred Fitzgibbon
Adams Fred Fitzgibbon John Fitzgibbon John Fitzgibbon
John Smith Mick Jagger Mick Jagger Mick Jagger
Fred Fitzgibbon John Smith John Smith John Smith

From the above table we can see that the first sort key grouped all the personal names that contained only family names separately from all the personal names that contained both a given and family name.  The second sort key alphabetically sorted the personal names in each group.  The third sort key was ineffective on the personal names and did not change the results produced by the second sort key.  So how can we correct the XSLT script to produce the desired results? 

Solution

Before we devise a solution, let's first take an inventory of the situation to ensure we understand the problem:

According to the above list we need two sort keys: one for the family name and the other for the given name.  Because the personal name string contains both the given name and family name separated by a single breaking space character, we will need to eliminate the unnecessary name part for both sort keys.  The most difficult sort key to construct will be the primary sort key for the family name where we must account for an optional given name part.  This will be somewhat complicated because section 10 Sorting in the XSLT specification says that xsl:sort tags must occur as the first elements when used in xsl:for-each tag.  Given that constraint in the XSLT specification, we will be unable to create temporary variables, with the xsl:variable tag, or call templates, with the xsl:call-template tag, to manipulate the personal name string and extract the appropriate name part for the sort key.  For example, we might have generated the following XSLT variables to hold the given and family names:

<!-- Variable to hold given name -->
  <xsl:variable name="name.given">
    <xsl:choose>
      <xsl:when test="contains(.,' ')">
        <xsl:value-of select="substring-before(.,' ')"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="string('')"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:variable>
 
<!-- Variable to hold family name -->
  <xsl:variable name="name.family">
    <xsl:choose>
      <xsl:when test="contains(.,' ')">
        <xsl:value-of select="substring-after(.,' ')"/>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="."/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:variable>

After considering the above issues, the following sort keys were constructed to solve the problem:

<xsl:sort select="concat(substring-after(.,' '),self::*[not(contains(.,' '))])"/>
<xsl:sort select="substring-before(.,' ')"/>

The first xsl:sort tag provides the primary sort key string for the family name part of the personal name.  This sort key is complex since we need to take into account that the given name may not be present in the personal name string.  To accomplish our goal for this sort key we will rely upon the mathematical notion of a commutative operation.  In general terms an operation is considered commutative when the same result is achieved regardless of which order the operation is performed.  Mathematical addition is considered commutative, e.g., 2 + 4 = 4 + 2 Since we are dealing with strings, rather than numbers, we will use the concatenation operation.  String concatenation is only commutative under certain circumstances, such as an empty string concatenated with another string or a string concatenated with itself.  We will use the commutative operation of an empty string concatenated with another string to generate the value for our primary sort key. 

Remember, we have two conditions to consider when generating the family name string from the personal name string, either the given name is present or not present.  Each condition will be represented as an argument to the string concatenation operation.  The first argument, substring-after(.,' '), to the XPath specification's concat function will generate the family name when a given name was present or it will generate an empty string.  According to the XPath specification, the substring-after function will return an empty string when it does not find the test string.  The test string, a breaking space, will only be found when a given name is present. 

The second argument, self::*[not(contains(.,' '))], to the XPath specification's concat function will generate the family name when a given name was not present or it will generate an empty string.  The XPath expression, self::*[not(contains(.,' '))], is generating either a single node or an empty node when the current node, self::*, does not contains a breaking space, [not(contains(.,' '))] Because the XPath specification's concat function takes strings as arguments, nodes are implicitly converted to strings as if the XPath specification's string function was performed.  According to the XPath specification's string function, empty node-sets will be converted to empty strings. 

The astute person will realize that the two arguments to the XPath specification's concat function are opposing conditions.  These opposing conditions work in conjunction with the XPath specification's concat function commutative properties when an empty string is concatenated with another string.  We could have changed the order of the arguments to the XPath specification's concat function and achieved the same results.  The following table illustrates this point:

Personal Name Concat Arg 1 Concat Arg 2 Concat Result
Jones   Jones Jones
Mick Jagger Jagger   Jagger
Smith   Smith Smith
John Fitzgibbon Fitzgibbon   Fitzgibbon
Adams   Adams Adams
John Smith Smith   Smith
Fred Fitzgibbon Fitzgibbon   Fitzgibbon

The second xsl:sort tag provides the secondary sort key string for the given name part of the personal name.  This sort key will only be used when the first xsl:sort tag contains identical sort keys for the family name.  We will use the XPath specification's, substring-before function to extract the given name from the personal name.  According to the XPath specification, the substring-before function will return an empty string when it does not find the test string.  The test string, a breaking space, will only be found when a given name is present.  Since an empty string will be returned when a personal name does not contain a given name, those personal names will logically sort before personal names that do contain a given name.  For example we have the personal names: Smith and John Smith.  The personal name Smith will sort before the personal name John Smith.  This satisfies our requirement to secondarily order all personal names that contain a given name, by the given name, and order them after any personal names that do not contain a given name, but have the same family name. 

The following XSLT script provides the complete working solution to the original problem:

<?xml version='1.0' encoding='utf-8'?>
<xsl:transform version='1.0' xmlns:xsl='http://www.w3.org/1999/XSL/Transform'>
  <xsl:output method='text' encoding='utf-8' media-type='text/plain'/>
  <xsl:template name='Root' match='/'>
    <xsl:value-of select="string('&#10;')"/>
    <xsl:for-each select="/html/body//span[@class='person']">
      <xsl:sort select="concat(substring-after(.,' '),self::*[not(contains(.,' '))])"/>
      <xsl:sort select="substring-before(.,' ')"/>
      <xsl:choose>
        <xsl:when test="contains(.,' ')">
          <xsl:variable name="given"  select="substring-before(.,' ')"/>
          <xsl:variable name="family" select="substring-after(.,' ')"/>
          <xsl:value-of select="concat($family,', ',$given,'&#10;')"/>
        </xsl:when>
        <xsl:otherwise>
          <xsl:value-of select="concat(.,'&#10;')"/>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:for-each>
  </xsl:template>
</xsl:transform>

Summation

A solution was provided for John Fitzgibbon original problem.  However, it should be pointed out that the solution is specific to his problem and not a general solution to sorting personal names which could occur in many different forms.  The solution demonstrates XSLT and XPath techniques through script snippets.  These techniques and script snippets may be useful to others and should be considered to be in the public domain by readers of this document.  Question and comments should be directed to me through my staff page in the references below. 

References