OCLC Online Computer
Library Center, Inc; Dublin, Ohio, USA
The Library of Congress Classification as a Knowledge Base for Automatic Subject Categorization
Abstract. This paper describes a set of experiments in adapting a
subset of the Library of Congress Classification for use as a database for
automatic classification. A high
degree of concept integrity was obtained when subject headings were mapped from
OCLC’s WorldCat database and filtered using the log-likelihood statistic.
We are interested in using the Library of Congress Classification (LCC) to classify Web resources and other full-text documents. Our research project has three goals:
·
To adapt the LCC for use
as a knowledge base for automatically classifying full text.
·
To exploit the LCC's
structure for online subject-oriented browsing.
·
To make the results of
our work freely available to the library community.
Our interest is consistent
with OCLC's six-month-old Global Product Strategy, whose cornerstone is a
proposal to make OCLC's WorldCat database the database of choice for Web users
wishing to access the high-quality resources that are held in or described by
the world's libraries. We agree with
Chan (2000) that the adaptation of library classification schemes for Internet
resources would permit semantic interoperabilty between the large store of MARC
records and the new resources, and would obviate the need to re-invent or
re-discover the principles of classification.
However,
the LCC presents some well-known obstacles for projects like ours. The complete set of LCC schedules has nearly
a quarter of a million sparse class definitions, whose literary warrant is
derived primarily from books published in the United States. Moreover, the class notation favors
hospitality over economy or consistency and the schedules are not designed for
automatic processing.
Nevertheless,
many researchers believe that these obstacles can be overcome. The landmark study of Larson (1992)
evaluated the suitability of the LCC as a knowledge base for the automatic
classification of bibliographic records.
Working with Schedule Z, Bibliography and Library Science,
he concluded that, once the schedule was simplified and additional subject
headings were imported, the resulting knowledge base could be used to cluster
bibliographic records by subject, a possible first step in the creation of a
fully classified record. More recently,
several digital library projects that recognize the utility of the LCC as a
classification scheme for Internet resources are exploring ways to adapt the
LCC for this new application. For
example, The Columbia University Digital Libraries Project[1]
created browsable hierarchical displays of the LCC, enhanced the subject
terminology in consultation with the LCC editors, and linked some of their Web
resources to it. CyberStacks[2]
and The WWW Virtual Library[3]
show the results of similar efforts, though their collections of resources are
more narrowly focused by subject.
Our work shares the high-level goals of these projects. The visible result is a demonstration using a fragment of the LCC to organize and access Internet and other full-text resources.
Previous research in document categorization suggests that a large number of sparsely populated classes present too many spurious targets for an automatic classifier. Accordingly, about 85% of the classes in Schedules Q, R, S and T were deleted using simple heuristics. All classes whose hierarchy contains cross-references to other LCC schedules or that mention geographic names and names of genres were eliminated. The effect of these reductions is shown in the fragment from Schedule S in Figure 1. Only the class shown in italics, SD426-SD428.22, is retained. The result is that the essential topic is identified but often at the cost of a flattened LCC hierarchy and a class definition represented as a range of LCC numbers. We made these simplifications for several reasons. First, we believe that topical assignment is the most pressing need for an automatic classifier designed for large collections of Web documents whose subject matter is unknown. Second, many of the genre names that result in a proliferation of LCC class definitions may be irrelevant to descriptions of Web resources that don’t fit very well into taxonomies of traditionally published material. Finally, in applications where the information omitted from our system is important, we believe it can be added—algorithmically, in some cases, or with human intervention in a computer-aided classification system.
SD426 – SD428.22 Forestry.
Conservation and protection.
Forest reserves.
SD426.A1-SD426.Z Forestry.
Conservation and protection.
Forest reserves. United
States. General.
SD426.A1-SD426.A5 Forestry.
Conservation and protection.
Forest reserves. United
States. General. Documents.
SD426.A5 Forestry.
Conservation and protection.
Forest reserves. United
States. General. Documents.
By date
SD426.A6-SD426.Z
Forestry. Conservation and
protection. Forest reserves. United States. General. General works.
Figure 1. A fragment from Schedule S showing an
enumeration of genres.
The
definitions of the remaining classes were enhanced using two sources of
terminology, the Library of Congress Subject Authority (LCSA) File and OCLC's
WorldCat database of bibliographic records.
Both sources of terminology have class assignments that are more
specific than the modified schedules.
An algorithm that correctly assigns the terminology must identify the
LCC class with the most specific range that contains LCC number listed in the
source record, which requires numeric as well as alphabetic comparisons. For example, the LCSA record that assigns
the headings iron ores and manganese ores to QE390.2.I76 is
mapped to QE390.Q.2.A-QE390.2.Z Geology.
Mineralogy. Special groups of minerals. Ore minerals. Special ore minerals A-Z, to which platinum ores, nickel ores, rare
earths, and antimony ores have also been applied using our processes. The algorithm rejects an assignment to the
more generic class QE389-QE390.5 Geology.
Minerology. Special groups of minerals, which contains mapped terms such as ores, geochemistry
and hydrothermal alternation.
This procedure is sufficient for mapping terminology from LCSA records but the subject headings obtained from WorldCat must undergo an additional processing step. While the class assignments in the LCSA file were created for the purpose of improving the LCC as a classification scheme, the WorldCat headings are only indirectly relevant to this goal, since their presence in a bibliographic record serves primarily to enhance subject access to the item being described. To simulate the editorial task of defining an LCC class by assigning a subject heading to it, we need to identify the most stable headings—the headings that most commonly appear with a given range of LCC numbers and do not appear elsewhere in the WorldCat sample.
Figure 2 illustrates one result. It shows an LCC class from Schedule R that has been enhanced with data from the LCSA file and from bibliographic records in a 140,000-record sample of WorldCat that contain class assignments from Schedule R paired with Library of Congress subject headings. The class defines a timely and popular subject, with the result that approximately 4000 subject headings from the WorldCat sample can be mapped to the LC class numbers in the range RJ370-RJ520. Since 4000 headings are unwieldy and possibly misleading, the subject-heading/class-number pairs are filtered using the log-likelihood statistic (Dunning 1993), which measures the strength of pairwise associations and is commonly used in information retrieval and computational linguistics research. The highly associated terms, such as attention-deficit hyperactivity disorder and autism in children, are highly relevant to the definition of the class and are only rarely paired with other LCC class numbers. Conversely, headings such as adjustment (psychology) and first aid are given low association scores by the log-likelihood measure. Though they may legitimately refer to a facet of a given work about diseases of children, they are only tangentially related to the meaning of the class in Figure 2 because they are paired with a large number of other LCC class assignments in the WorldCat database.
Class number:
RJ370-RJ520
Hierarchy:
Pediatrics.
Diseases of children.
High associations:
Chronically ill children, mentally ill children, epilepsy in children, speech therapy for children, attention-deficit hyperactivity disorder, autism in children, pediatric neurology, rheumatoid arthritis in children, chronic diseases in children
Low associations:
Medical
emergencies, child health services, stress (psychology), resuscitation,
telephone, children, adjustment (psychology), first aid, disabled, family
therapy, diseases, risk, physical fitness, child nutrition, infants (newborn),
youth, cerebral palsy, nervous system
LCSA headings:
Adolescent psychopathology scale, aps (adolescent psychiatry), psychiatric rating scales, electroconvulsive therapy for children, electroconvulsive therapy for teenagers, violence in children, child psychopathology, children and violence, violence in adolescence, adolescent psychopathology, fragile x syndrome, syndromes, x-linked mental retardation
Figure 2. An enhanced LCC record from Schedule R Medicine.
The database was constructed from three sources: machine-readable files of the LCC Schedules Q, R, S and T obtained from the Library of Congress; a machine-readable version of the Library of Congress Subject Authority File; and a subset from WorldCat, extracted in April 2000, which contains every bibliographic record that has a Dewey Decimal Classification (DDC) as well an LCC assignment. Though the DDC assignment is irrelevant to the present study, this criterion produces a large but manageable sample of records with a broad distribution of subjects. Most of the records are also indexed with Library of Congress subject headings. The pairing of these headings with LCC numbers constitute the raw data for this study. Table 1 shows their distribution for Schedules Q, R, S and T. From the WorldCat sample, 865,881 pairs of class numbers and subject headings were generated. By contrast, Larson (1992), who reported guarded conclusions about the utility of the LCC as an automatic classifier, conducted his tests with only 30,000 pairs.
LCC Schedule records subject heading pairs
_______________________________________________________
Q Science 326,391 637,470
R Medicine 139,280 374,465
T Technology 314,357 269,546
Total 863,846 865,881
Table 1. Subject heading assignments in a WorldCat
sample.
Using the
heuristics for simplifying the LCC schedules described in the previous section,
we eliminated 91% of the classes, with the result that 6314 potential concept
definitions have been extracted from the four schedules. Table 2 shows the distribution of the two
classes of subject headings after they have been mapped to the simplified LCC
schedules.
The raw data generates two hypotheses that motivate the present study. First, the editorial mappings from the LCSA file, though highly trustworthy, are inadequate for creating the class definitions required for an automatic classifier because they populate only 9% of the classes in the simplified schedules, compared to 86% for headings obtained from WorldCat. Second, the extremely large standard deviations in the distributions of the mappings from WorldCat records present problems for an automatic classifier and suggest that this data must be further processed. The second half of Table 2 shows the results when the terminology obtained from WorldCat is trimmed to the mean, the 120 most highly associated pairs of subject headings and LCC class assignments as measured by the log-likelihood statistic. Logically, this change has the effect of consolidating the locations of the mapped terms in the LCC class structure and reducing the mean, standard deviation and maximum to values close to those found when the Dewey Decimal Classification is converted to a database for automatic classification (Thompson, et al 1997).
Total
5,460/6,313
(86%) 618/6,313 (9%)
Mean 120
13
Standard deviation
477
18
Largest number of headings in a single
record 12,320 1,036
Total 5,460/6,313 Same
Mean 36
Standard deviation 45
Largest number of headings in a single record 120
Table 2. Mappings of two sources of subject headings
to the LCC.
The
database design was evaluated with the so-called self-match test that was used
by Thompson, et al. (1997) to test the suitability of the Dewey Decimal
Classification as a knowledge base for automatic classification. Databases were created using OCLC’s Scorpion[4] software and indexed with the subject
headings found in the LCC hierarchy for each record, as well as those applied
from the LCSA file and the WorldCat sample.
The baseline database contained terms only from the first two sources,
while two additional test databases included Library of Congress subject
headings: one that had every heading found in the WorldCat sample, and one that
contained only the most highly associated terms for each class, as summarized
in Table 2. All three databases
contained 6314 records corresponding to the classes in the set of truncated LCC
schedules. To test the databases, a set
of queries were generated that consisted of the same 6314 records. These were run individually against each
database to generate result sets of twenty items that could be interpreted as ranked
lists of LCC class assignments.
A
self-match occurs if the query retrieves itself as the highest ranked result, a
test which measures the concept integrity of the database and hence the
classification scheme from which it is derived. Accordingly, if the subject headings that define the LCC classes
are randomly scattered throughout the database, the query would fail to
retrieve itself as the top-ranked record because many other records would be
highly similar. But if the terminology
is distributed to form meaningful, distinct concepts, the number of
self-matches should be high. Figure 3
has a fragment of a sample run, showing a successful self-match for the LCC
class illustrated in Figure 1.
Query: RJ370-RJ520
Pediatrics. Diseases of
children.
Ranked results:
Figure 3. A sample result from a self-match
experiment.
Database: Baseline with WorldCat headings with Filtered WorldCat headings
Rank: 1 4,325 (68.5%) 1,725 (27.3%) 5,755 (91.1%)
2
910 (14.4%) 1,659
(26.3%) 161 (2.50%)
3 246 (3.89%) 1,243 (19.7%) 24 (0.38%)
4 137 (2.17%)
817 (12.9%)
19 (0.30%)
5 71 (1.12%) 277 (4.38%) 12 (0.19%)
Table 3. The top five
results for the self-match test on three databases.
Our project Web
page[5]
can be monitored for further progress reports and demonstrations.
References
Chan, Lois Mai (2000). Exploiting LCSH, LCC and DDC to Retrieve
Networked Resources: Issues and Challenges.
Paper presented at the Conference on Bibliographic Control in the New
Millenium, Library of Congress, November 2000.
Accessible at:
<http://lcweb.loc.gov/catdir/bibcontrol/chan_paper.html>
Dunning,
Ted (1993). Accurate methods for the
statistics of surprise and coincidence.
Computational Linguistics. Vol. 19, No/ 1, 61-74.
Larson, Ray (1992). Experiments in Automatic Library of Congress
Classification. Journal of the
American Society for Information Science 43(2):130-148.
Thompson, Roger; Shafer, Keith,
and Vizine-Goetz, Diane (1997).
Evaluating Dewey Concepts as a Knowledge Base for Automatic Subject
Assignment. Paper presented at the
first ACM Digital Libraries Workshop, January 1997. Accessible at: http://orc.rsch.oclc.org:6109/eval_dc.html
[1] Accessible at: <http://www.columbia.edu/cu/libraries/digital/>
[2]Accessible at:
<http://www.public.iastate.edu/~CYBERSTACKS/>
[3] Accessible at: <http://vlib.org/>
[4] Accessible at: <http://orc.rsch.oclc.org:6109/>
[5] Accessible at
<http://staff.oclc.org/~godby/auto_class/auto.html>