OCLC Online Computer Library Center, Inc; Dublin, Ohio, USA
The Library of Congress Classification as a Knowledge Base for Automatic Subject Categorization
Abstract. This paper describes a set of experiments in adapting a subset of the Library of Congress Classification for use as a database for automatic classification. A high degree of concept integrity was obtained when subject headings were mapped from OCLC’s WorldCat database and filtered using the log-likelihood statistic.
We are interested in using the Library of Congress Classification (LCC) to classify Web resources and other full-text documents. Our research project has three goals:
· To adapt the LCC for use as a knowledge base for automatically classifying full text.
· To exploit the LCC's structure for online subject-oriented browsing.
· To make the results of our work freely available to the library community.
Our interest is consistent with OCLC's six-month-old Global Product Strategy, whose cornerstone is a proposal to make OCLC's WorldCat database the database of choice for Web users wishing to access the high-quality resources that are held in or described by the world's libraries. We agree with Chan (2000) that the adaptation of library classification schemes for Internet resources would permit semantic interoperabilty between the large store of MARC records and the new resources, and would obviate the need to re-invent or re-discover the principles of classification.
However, the LCC presents some well-known obstacles for projects like ours. The complete set of LCC schedules has nearly a quarter of a million sparse class definitions, whose literary warrant is derived primarily from books published in the United States. Moreover, the class notation favors hospitality over economy or consistency and the schedules are not designed for automatic processing.
Nevertheless, many researchers believe that these obstacles can be overcome. The landmark study of Larson (1992) evaluated the suitability of the LCC as a knowledge base for the automatic classification of bibliographic records. Working with Schedule Z, Bibliography and Library Science, he concluded that, once the schedule was simplified and additional subject headings were imported, the resulting knowledge base could be used to cluster bibliographic records by subject, a possible first step in the creation of a fully classified record. More recently, several digital library projects that recognize the utility of the LCC as a classification scheme for Internet resources are exploring ways to adapt the LCC for this new application. For example, The Columbia University Digital Libraries Project created browsable hierarchical displays of the LCC, enhanced the subject terminology in consultation with the LCC editors, and linked some of their Web resources to it. CyberStacks and The WWW Virtual Library show the results of similar efforts, though their collections of resources are more narrowly focused by subject.
Our work shares the high-level goals of these projects. The visible result is a demonstration using a fragment of the LCC to organize and access Internet and other full-text resources.
Previous research in document categorization suggests that a large number of sparsely populated classes present too many spurious targets for an automatic classifier. Accordingly, about 85% of the classes in Schedules Q, R, S and T were deleted using simple heuristics. All classes whose hierarchy contains cross-references to other LCC schedules or that mention geographic names and names of genres were eliminated. The effect of these reductions is shown in the fragment from Schedule S in Figure 1. Only the class shown in italics, SD426-SD428.22, is retained. The result is that the essential topic is identified but often at the cost of a flattened LCC hierarchy and a class definition represented as a range of LCC numbers. We made these simplifications for several reasons. First, we believe that topical assignment is the most pressing need for an automatic classifier designed for large collections of Web documents whose subject matter is unknown. Second, many of the genre names that result in a proliferation of LCC class definitions may be irrelevant to descriptions of Web resources that don’t fit very well into taxonomies of traditionally published material. Finally, in applications where the information omitted from our system is important, we believe it can be added—algorithmically, in some cases, or with human intervention in a computer-aided classification system.
SD426 – SD428.22 Forestry. Conservation and protection. Forest reserves.
SD426.A1-SD426.Z Forestry. Conservation and protection. Forest reserves. United States. General.
SD426.A1-SD426.A5 Forestry. Conservation and protection. Forest reserves. United States. General. Documents.
SD426.A5 Forestry. Conservation and protection. Forest reserves. United States. General. Documents. By date
SD426.A6-SD426.Z Forestry. Conservation and protection. Forest reserves. United States. General. General works.
Figure 1. A fragment from Schedule S showing an enumeration of genres.
The definitions of the remaining classes were enhanced using two sources of terminology, the Library of Congress Subject Authority (LCSA) File and OCLC's WorldCat database of bibliographic records. Both sources of terminology have class assignments that are more specific than the modified schedules. An algorithm that correctly assigns the terminology must identify the LCC class with the most specific range that contains LCC number listed in the source record, which requires numeric as well as alphabetic comparisons. For example, the LCSA record that assigns the headings iron ores and manganese ores to QE390.2.I76 is mapped to QE390.Q.2.A-QE390.2.Z Geology. Mineralogy. Special groups of minerals. Ore minerals. Special ore minerals A-Z, to which platinum ores, nickel ores, rare earths, and antimony ores have also been applied using our processes. The algorithm rejects an assignment to the more generic class QE389-QE390.5 Geology. Minerology. Special groups of minerals, which contains mapped terms such as ores, geochemistry and hydrothermal alternation.
This procedure is sufficient for mapping terminology from LCSA records but the subject headings obtained from WorldCat must undergo an additional processing step. While the class assignments in the LCSA file were created for the purpose of improving the LCC as a classification scheme, the WorldCat headings are only indirectly relevant to this goal, since their presence in a bibliographic record serves primarily to enhance subject access to the item being described. To simulate the editorial task of defining an LCC class by assigning a subject heading to it, we need to identify the most stable headings—the headings that most commonly appear with a given range of LCC numbers and do not appear elsewhere in the WorldCat sample.
Figure 2 illustrates one result. It shows an LCC class from Schedule R that has been enhanced with data from the LCSA file and from bibliographic records in a 140,000-record sample of WorldCat that contain class assignments from Schedule R paired with Library of Congress subject headings. The class defines a timely and popular subject, with the result that approximately 4000 subject headings from the WorldCat sample can be mapped to the LC class numbers in the range RJ370-RJ520. Since 4000 headings are unwieldy and possibly misleading, the subject-heading/class-number pairs are filtered using the log-likelihood statistic (Dunning 1993), which measures the strength of pairwise associations and is commonly used in information retrieval and computational linguistics research. The highly associated terms, such as attention-deficit hyperactivity disorder and autism in children, are highly relevant to the definition of the class and are only rarely paired with other LCC class numbers. Conversely, headings such as adjustment (psychology) and first aid are given low association scores by the log-likelihood measure. Though they may legitimately refer to a facet of a given work about diseases of children, they are only tangentially related to the meaning of the class in Figure 2 because they are paired with a large number of other LCC class assignments in the WorldCat database.
Diseases of children.
Chronically ill children, mentally ill children, epilepsy in children, speech therapy for children, attention-deficit hyperactivity disorder, autism in children, pediatric neurology, rheumatoid arthritis in children, chronic diseases in children
Medical emergencies, child health services, stress (psychology), resuscitation, telephone, children, adjustment (psychology), first aid, disabled, family therapy, diseases, risk, physical fitness, child nutrition, infants (newborn), youth, cerebral palsy, nervous system
Adolescent psychopathology scale, aps (adolescent psychiatry), psychiatric rating scales, electroconvulsive therapy for children, electroconvulsive therapy for teenagers, violence in children, child psychopathology, children and violence, violence in adolescence, adolescent psychopathology, fragile x syndrome, syndromes, x-linked mental retardation
Figure 2. An enhanced LCC record from Schedule R Medicine.
The database was constructed from three sources: machine-readable files of the LCC Schedules Q, R, S and T obtained from the Library of Congress; a machine-readable version of the Library of Congress Subject Authority File; and a subset from WorldCat, extracted in April 2000, which contains every bibliographic record that has a Dewey Decimal Classification (DDC) as well an LCC assignment. Though the DDC assignment is irrelevant to the present study, this criterion produces a large but manageable sample of records with a broad distribution of subjects. Most of the records are also indexed with Library of Congress subject headings. The pairing of these headings with LCC numbers constitute the raw data for this study. Table 1 shows their distribution for Schedules Q, R, S and T. From the WorldCat sample, 865,881 pairs of class numbers and subject headings were generated. By contrast, Larson (1992), who reported guarded conclusions about the utility of the LCC as an automatic classifier, conducted his tests with only 30,000 pairs.
LCC Schedule records subject heading pairs
Q Science 326,391 637,470
R Medicine 139,280 374,465
T Technology 314,357 269,546
Total 863,846 865,881
Table 1. Subject heading assignments in a WorldCat sample.
Using the heuristics for simplifying the LCC schedules described in the previous section, we eliminated 91% of the classes, with the result that 6314 potential concept definitions have been extracted from the four schedules. Table 2 shows the distribution of the two classes of subject headings after they have been mapped to the simplified LCC schedules.
The raw data generates two hypotheses that motivate the present study. First, the editorial mappings from the LCSA file, though highly trustworthy, are inadequate for creating the class definitions required for an automatic classifier because they populate only 9% of the classes in the simplified schedules, compared to 86% for headings obtained from WorldCat. Second, the extremely large standard deviations in the distributions of the mappings from WorldCat records present problems for an automatic classifier and suggest that this data must be further processed. The second half of Table 2 shows the results when the terminology obtained from WorldCat is trimmed to the mean, the 120 most highly associated pairs of subject headings and LCC class assignments as measured by the log-likelihood statistic. Logically, this change has the effect of consolidating the locations of the mapped terms in the LCC class structure and reducing the mean, standard deviation and maximum to values close to those found when the Dewey Decimal Classification is converted to a database for automatic classification (Thompson, et al 1997).
Total 5,460/6,313 (86%) 618/6,313 (9%)
Mean 120 13
Standard deviation 477 18
Total 5,460/6,313 Same
Standard deviation 45
Largest number of headings in a single record 120
Table 2. Mappings of two sources of subject headings to the LCC.
The database design was evaluated with the so-called self-match test that was used by Thompson, et al. (1997) to test the suitability of the Dewey Decimal Classification as a knowledge base for automatic classification. Databases were created using OCLC’s Scorpion software and indexed with the subject headings found in the LCC hierarchy for each record, as well as those applied from the LCSA file and the WorldCat sample. The baseline database contained terms only from the first two sources, while two additional test databases included Library of Congress subject headings: one that had every heading found in the WorldCat sample, and one that contained only the most highly associated terms for each class, as summarized in Table 2. All three databases contained 6314 records corresponding to the classes in the set of truncated LCC schedules. To test the databases, a set of queries were generated that consisted of the same 6314 records. These were run individually against each database to generate result sets of twenty items that could be interpreted as ranked lists of LCC class assignments.
A self-match occurs if the query retrieves itself as the highest ranked result, a test which measures the concept integrity of the database and hence the classification scheme from which it is derived. Accordingly, if the subject headings that define the LCC classes are randomly scattered throughout the database, the query would fail to retrieve itself as the top-ranked record because many other records would be highly similar. But if the terminology is distributed to form meaningful, distinct concepts, the number of self-matches should be high. Figure 3 has a fragment of a sample run, showing a successful self-match for the LCC class illustrated in Figure 1.
Query: RJ370-RJ520 Pediatrics. Diseases of children.
Figure 3. A sample result from a self-match experiment.
Database: Baseline with WorldCat headings with Filtered WorldCat headings
Rank: 1 4,325 (68.5%) 1,725 (27.3%) 5,755 (91.1%)
2 910 (14.4%) 1,659 (26.3%) 161 (2.50%)
3 246 (3.89%) 1,243 (19.7%) 24 (0.38%)
4 137 (2.17%) 817 (12.9%) 19 (0.30%)
5 71 (1.12%) 277 (4.38%) 12 (0.19%)
Table 3. The top five results for the self-match test on three databases.
Our project Web page can be monitored for further progress reports and demonstrations.
Chan, Lois Mai (2000). Exploiting LCSH, LCC and DDC to Retrieve Networked Resources: Issues and Challenges. Paper presented at the Conference on Bibliographic Control in the New Millenium, Library of Congress, November 2000. Accessible at:
Dunning, Ted (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics. Vol. 19, No/ 1, 61-74.
Larson, Ray (1992). Experiments in Automatic Library of Congress Classification. Journal of the American Society for Information Science 43(2):130-148.
Thompson, Roger; Shafer, Keith, and Vizine-Goetz, Diane (1997). Evaluating Dewey Concepts as a Knowledge Base for Automatic Subject Assignment. Paper presented at the first ACM Digital Libraries Workshop, January 1997. Accessible at: http://orc.rsch.oclc.org:6109/eval_dc.html