Carol Jean Godby and Jay Stuler

OCLC Online Computer Library Center, Inc;  Dublin, Ohio, USA

 

 

The Library of Congress Classification as a Knowledge Base for Automatic Subject Categorization

 

Abstract.  This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification.   A high degree of concept integrity was obtained when subject headings were mapped from OCLC’s WorldCat database and filtered using the log-likelihood statistic.

 

 

  1. Introduction 

      We are interested in using the Library of Congress Classification (LCC) to classify Web resources and other full-text documents.  Our research project has three goals:

 

·        To adapt the LCC for use as a knowledge base for automatically classifying full text.

·        To exploit the LCC's structure for online subject-oriented browsing.

·        To make the results of our work freely available to the library community.

 

Our interest is consistent with OCLC's six-month-old Global Product Strategy, whose cornerstone is a proposal to make OCLC's WorldCat database the database of choice for Web users wishing to access the high-quality resources that are held in or described by the world's libraries.  We agree with Chan (2000) that the adaptation of library classification schemes for Internet resources would permit semantic interoperabilty between the large store of MARC records and the new resources, and would obviate the need to re-invent or re-discover the principles of classification.

However, the LCC presents some well-known obstacles for projects like ours.  The complete set of LCC schedules has nearly a quarter of a million sparse class definitions, whose literary warrant is derived primarily from books published in the United States.  Moreover, the class notation favors hospitality over economy or consistency and the schedules are not designed for automatic processing. 

Nevertheless, many researchers believe that these obstacles can be overcome.  The landmark study of Larson (1992) evaluated the suitability of the LCC as a knowledge base for the automatic classification of bibliographic records.  Working with Schedule Z, Bibliography and Library Science, he concluded that, once the schedule was simplified and additional subject headings were imported, the resulting knowledge base could be used to cluster bibliographic records by subject, a possible first step in the creation of a fully classified record.  More recently, several digital library projects that recognize the utility of the LCC as a classification scheme for Internet resources are exploring ways to adapt the LCC for this new application.  For example, The Columbia University Digital Libraries Project[1] created browsable hierarchical displays of the LCC, enhanced the subject terminology in consultation with the LCC editors, and linked some of their Web resources to it.  CyberStacks[2] and The WWW Virtual Library[3] show the results of similar efforts, though their collections of resources are more narrowly focused by subject.

Our work shares the high-level goals of these projects.  The visible result is a demonstration using a fragment of the LCC to organize and access Internet and other full-text resources. 

 

 

2.      Adapting the LCC schedules for automatic classification

Because our project requires that documents be classified automatically, we build on Larson's research and extend its applicability to full-text documents.  We selected Schedules Q, Science; R, Medicine; S, Agriculture and T, Technology for a pilot study that would shed light on how the LCC should be adapted to accomplish this goal.  The topics represented in these schedules provide reasonably broad coverage for a prototype system but present a good test of the system’s ability to discriminate among similar subjects.

Previous research in document categorization suggests that a large number of sparsely populated classes present too many spurious targets for an automatic classifier.  Accordingly, about 85% of the classes in Schedules Q, R, S and T were deleted using simple heuristics.  All classes whose hierarchy contains cross-references to other LCC schedules or that mention geographic names and names of genres were eliminated.  The effect of these reductions is shown in the fragment from Schedule S in Figure 1.  Only the class shown in italics, SD426-SD428.22, is retained.  The result is that the essential topic is identified but often at the cost of a flattened LCC hierarchy and a class definition represented as a range of LCC numbers.  We made these simplifications for several reasons.  First, we believe that topical assignment is the most pressing need for an automatic classifier designed for large collections of Web documents whose subject matter is unknown.  Second, many of the genre names that result in a proliferation of LCC class definitions may be irrelevant to descriptions of Web resources that don’t fit very well into taxonomies of traditionally published material.  Finally, in applications where the information omitted from our system is important, we believe it can be added—algorithmically, in some cases, or with human intervention in a computer-aided classification system.

 

SD426 – SD428.22 Forestry.  Conservation and protection.  Forest reserves.

SD426.A1-SD426.Z Forestry.  Conservation and protection.  Forest reserves.  United States.       General.

SD426.A1-SD426.A5 Forestry.  Conservation and protection.  Forest reserves.  United States.  General.  Documents.

SD426.A5 Forestry.  Conservation and protection.  Forest reserves.  United States.  General.  Documents.  By date

SD426.A6-SD426.Z  Forestry.  Conservation and protection.  Forest reserves.  United States.  General.  General works.

 

Figure 1.  A fragment from Schedule S showing an enumeration of genres.

 

The definitions of the remaining classes were enhanced using two sources of terminology, the Library of Congress Subject Authority (LCSA) File and OCLC's WorldCat database of bibliographic records.  Both sources of terminology have class assignments that are more specific than the modified schedules.  An algorithm that correctly assigns the terminology must identify the LCC class with the most specific range that contains LCC number listed in the source record, which requires numeric as well as alphabetic comparisons.  For example, the LCSA record that assigns the headings iron ores and manganese ores to QE390.2.I76 is mapped to QE390.Q.2.A-QE390.2.Z Geology. Mineralogy. Special groups of minerals. Ore minerals. Special ore minerals A-Z, to which platinum ores, nickel ores, rare earths, and antimony ores have also been applied using our processes.  The algorithm rejects an assignment to the more generic class QE389-QE390.5 Geology.  Minerology. Special groups of minerals, which contains mapped terms such as ores, geochemistry and hydrothermal alternation.

This procedure is sufficient for mapping terminology from LCSA records but the subject headings obtained from WorldCat must undergo an additional processing step.  While the class assignments in the LCSA file were created for the purpose of improving the LCC as a classification scheme, the WorldCat headings are only indirectly relevant to this goal, since their presence in a bibliographic record serves primarily to enhance subject access to the item being described.  To simulate the editorial task of defining an LCC class by assigning a subject heading to it, we need to identify the most stable headings—the headings that most commonly appear with a given range of LCC numbers and do not appear elsewhere in the WorldCat sample. 

Figure 2 illustrates one result.  It shows an LCC class from Schedule R that has been enhanced with data from the LCSA file and from bibliographic records in a 140,000-record sample of WorldCat that contain class assignments from Schedule R paired with Library of Congress subject headings.  The class defines a timely and popular subject, with the result that approximately 4000 subject headings from the WorldCat sample can be mapped to the LC class numbers in the range RJ370-RJ520.  Since 4000 headings are unwieldy and possibly misleading, the subject-heading/class-number pairs are filtered using the log-likelihood statistic (Dunning 1993), which measures the strength of pairwise associations and is commonly used in information retrieval and computational linguistics research.  The highly associated terms, such as attention-deficit hyperactivity disorder and autism in children, are highly relevant to the definition of the class and are only rarely paired with other LCC class numbers.  Conversely, headings such as adjustment (psychology) and first aid are given low association scores by the log-likelihood measure.  Though they may legitimately refer to a facet of a given work about diseases of children, they are only tangentially related to the meaning of the class in Figure 2 because they are paired with a large number of other LCC class assignments in the WorldCat database.

 

Class number:

RJ370-RJ520

 

Hierarchy:

     Pediatrics.

          Diseases of children.

     

High associations:

Chronically ill children, mentally ill children, epilepsy in children, speech therapy for children, attention-deficit hyperactivity disorder, autism in children, pediatric neurology, rheumatoid arthritis in children, chronic diseases in children

 

Low associations:

Medical emergencies, child health services, stress (psychology), resuscitation, telephone, children, adjustment (psychology), first aid, disabled, family therapy, diseases, risk, physical fitness, child nutrition, infants (newborn), youth, cerebral palsy, nervous system

 

LCSA headings:

Adolescent psychopathology scale, aps (adolescent psychiatry), psychiatric rating scales, electroconvulsive therapy for children, electroconvulsive therapy for teenagers, violence in children, child psychopathology, children and violence, violence in adolescence, adolescent psychopathology, fragile x syndrome, syndromes, x-linked mental retardation

 

Figure 2.  An enhanced LCC record from Schedule R Medicine.

 

 

3. An evaluation

The major outcome of this project is a database design that can be used for the automatic classification of full-text documents.  Each record is an LCC class that has been enhanced with subject headings from OCLC’s WorldCat database and the Library of Congress Subject Authority File. 

The database was constructed from three sources: machine-readable files of the LCC Schedules Q, R, S and T obtained from the Library of Congress; a machine-readable version of the Library of Congress Subject Authority File; and a subset from WorldCat, extracted in April 2000, which contains every bibliographic record that has a Dewey Decimal Classification (DDC) as well an LCC assignment.  Though the DDC assignment is irrelevant to the present study, this criterion produces a large but manageable sample of records with a broad distribution of subjects.  Most of the records are also indexed with Library of Congress subject headings.  The pairing of these headings with LCC numbers constitute the raw data for this study.  Table 1 shows their distribution for Schedules Q, R, S and T.  From the WorldCat sample, 865,881 pairs of class numbers and subject headings were generated.  By contrast, Larson (1992), who reported guarded conclusions about the utility of the LCC as an automatic classifier, conducted his tests with only 30,000 pairs.

 

                                                 WorldCat          LCC class/

LCC Schedule                          records              subject heading pairs

_______________________________________________________       

         Q Science                       326,391               637,470

         R Medicine                     139,280               374,465

         S Agriculture                    83,818               154,400

         T Technology                  314,357               269,546

_______________________________________________________

         Total                                  863,846               865,881

 

Table 1.  Subject heading assignments in a WorldCat sample.

 

            Using the heuristics for simplifying the LCC schedules described in the previous section, we eliminated 91% of the classes, with the result that 6314 potential concept definitions have been extracted from the four schedules.  Table 2 shows the distribution of the two classes of subject headings after they have been mapped to the simplified LCC schedules. 

The raw data generates two hypotheses that motivate the present study.  First, the editorial mappings from the LCSA file, though highly trustworthy, are inadequate for creating the class definitions required for an automatic classifier because they populate only 9% of the classes in the simplified schedules, compared to 86% for headings obtained from WorldCat.  Second, the extremely large standard deviations in the distributions of the mappings from WorldCat records present problems for an automatic classifier and suggest that this data must be further processed.  The second half of Table 2 shows the results when the terminology obtained from WorldCat is trimmed to the mean, the 120 most highly associated pairs of subject headings and LCC class assignments as measured by the log-likelihood statistic.  Logically, this change has the effect of consolidating the locations of the mapped terms in the LCC class structure and reducing the mean, standard deviation and maximum to values close to those found when the Dewey Decimal Classification is converted to a database for automatic classification (Thompson, et al 1997).

 

                                 Source of Headings:              WorldCat                          LCSA

LCC Schedule:   

Q,R,S and T

        Total                                                                   5,460/6,313 (86%)             618/6,313 (9%)

        Mean                                                                             120                                  13

        Standard deviation                                                          477                                   18

        Largest number of headings in a single record              12,320                              1,036

Q,R,S and T (filtered)

        Total                                                                   5,460/6,313                            Same

        Mean                                                                               36

        Standard deviation                                                            45

        Largest number of headings in a single record                  120

 

Table 2.  Mappings of two sources of subject headings to the LCC.      

                                                                                                           

The database design was evaluated with the so-called self-match test that was used by Thompson, et al. (1997) to test the suitability of the Dewey Decimal Classification as a knowledge base for automatic classification.  Databases were created using OCLC’s Scorpion[4]  software and indexed with the subject headings found in the LCC hierarchy for each record, as well as those applied from the LCSA file and the WorldCat sample.  The baseline database contained terms only from the first two sources, while two additional test databases included Library of Congress subject headings: one that had every heading found in the WorldCat sample, and one that contained only the most highly associated terms for each class, as summarized in Table 2.  All three databases contained 6314 records corresponding to the classes in the set of truncated LCC schedules.  To test the databases, a set of queries were generated that consisted of the same 6314 records.  These were run individually against each database to generate result sets of twenty items that could be interpreted as ranked lists of LCC class assignments. 

A self-match occurs if the query retrieves itself as the highest ranked result, a test which measures the concept integrity of the database and hence the classification scheme from which it is derived.  Accordingly, if the subject headings that define the LCC classes are randomly scattered throughout the database, the query would fail to retrieve itself as the top-ranked record because many other records would be highly similar.  But if the terminology is distributed to form meaningful, distinct concepts, the number of self-matches should be high.  Figure 3 has a fragment of a sample run, showing a successful self-match for the LCC class illustrated in Figure 1.

 

Query: RJ370-RJ520  Pediatrics.  Diseases of children.

Ranked results:

  1. RJ370-RJ520  Pediatrics.  Diseases of children.
  2. RJ499-RJ507  Pediatrics.  Child psychiatry.  Diseases of children.  Mental disorders of children and adolescents.  Child psychiatry.  Child mental health services.
  3. RJ506.A-RJ506.Z  Pediatrics.  Diseases of children.  Mental disorders of children and adolescents. Child psychiatry.  Child mental health services.  Specific disorders A-Z.
  4. RJ503.7.A-RJ503.7.Z  Pediatrics.  Diseases of children.  Mental disorders of children and adolescents. Child psychiatry.  Child mental health services.  Examination.  Assessment.  Diagnosis.
  5. RJ496-A-RJ496.Z  Pediatrics.  Diseases of children.  Diseases of the nervous system.  By disease, A-Z.

 

Figure 3.   A sample result from a self-match experiment.

 

Table 3 is a summary of the results from the self-match test on all three databases.  The difference in first-ranked self-matches between the baseline and the database with filtered WorldCat terms is over 22%.  This difference suggests that subject headings from an external source can be used to define distinct concepts, and are, perhaps, necessary for converting the LCC into a knowledge base for a useful automatic classification system.  However, the results from the database with the unfiltered WorldCat headings suggest that the headings must be properly processed because they can also introduce noise that severely degrades performance.

                                                                                                                  

           Database:         Baseline           with WorldCat headings    with Filtered WorldCat headings         

Rank:             1       4,325 (68.5%)               1,725 (27.3%)               5,755 (91.1%)

                     2         910 (14.4%)               1,659 (26.3%)                 161 (2.50%)

            3         246 (3.89%)               1,243 (19.7%)                   24 (0.38%)

            4         137 (2.17%)                 817 (12.9%)                   19 (0.30%)

            5           71 (1.12%)                 277 (4.38%)                   12 (0.19%)

 

Table 3. The top five results for the self-match test on three databases.

 

 

4. Future work 

            The system we have described uses a library classification scheme, the Library of Congress Classification, to automatically classify full-text documents.  Our experiments to date have focused on adaptations and enhancements that are possible with a set of subject headings and class assignments produced by catalogers who create bibliographic records, but the same model is extensible, in principle, to terminology that has been automatically extracted from documents that have been automatically classified.  Depending on the application, resources that are successfully mapped to a record such as the one shown in Figure 1 may or may not be considered fully classified.  If the goal is a collection of high-quality bibliographic records, an automatic classification that results in an assignment to a range of LCC class numbers would probably have to be made more specific with a human-mediated process.  But if the goal is subject access to a collection of resources where none existed before, the links to well-established subject hierarchies and subject headings made possible by an automatic classifier based on the LCC may be valuable raw material for a resource that facilitates searching and browsing by subject. 

            Our project Web page[5] can be monitored for further progress reports and demonstrations.

 

 

References

 

Chan, Lois Mai (2000).  Exploiting LCSH, LCC and DDC to Retrieve Networked Resources: Issues and Challenges.  Paper presented at the Conference on Bibliographic Control in the New Millenium, Library of Congress, November 2000.  Accessible at:

<http://lcweb.loc.gov/catdir/bibcontrol/chan_paper.html>

 

Dunning, Ted (1993).  Accurate methods for the statistics of surprise and coincidence.  Computational Linguistics. Vol. 19, No/ 1, 61-74.

 

Larson, Ray (1992).  Experiments in Automatic Library of Congress Classification.  Journal of the American Society for Information Science 43(2):130-148.

 

Thompson, Roger; Shafer, Keith, and Vizine-Goetz, Diane (1997).  Evaluating Dewey Concepts as a Knowledge Base for Automatic Subject Assignment.   Paper presented at the first ACM Digital Libraries Workshop, January 1997.  Accessible at: http://orc.rsch.oclc.org:6109/eval_dc.html

 

 

Notes



[1] Accessible at: <http://www.columbia.edu/cu/libraries/digital/>

 

[2]Accessible at: <http://www.public.iastate.edu/~CYBERSTACKS/>

 

[3] Accessible at: <http://vlib.org/>

 

[4] Accessible at: <http://orc.rsch.oclc.org:6109/>

 

[5] Accessible at <http://staff.oclc.org/~godby/auto_class/auto.html>