Library Classification Schemes and Access to Electronic Collections: Enhancement of the Dewey Decimal Classification with Supplemental Vocabulary

 

Diane Vizine-Goetz and Jean Godby

Office of Research

OCLC Online Computer Library Center, Inc.

6565 Frantz Road

Dublin, OH 43017-3395

 

Copyright 1997 American Society for Information Science. Electronic version of a paper published in Advances in classification research Volume 7: proceedings of the 7th ASIS SIG/CR classification research workshop, 20 October 1996, Baltimore, Maryland, Paul Solomon, ed., Medford, N. J.: Information Today, Inc.

Abstract

A traditional library classification scheme, such as the Dewey Decimal Classification, requires ongoing improvements to the terminology of caption headings and to the currency, size and scope of its indexing vocabulary if it is to function effectively as a knowledge structuring tool for electronic collections. In this paper we describe two complementary research efforts to enhance the DDC with supplemental vocabulary. One project focuses on effecting links between the DDC and other subject access systems. The other project is concerned with associating current and end-user terminology from full text with the DDC. With these more versatile versions of the DDC, we are exploring the potential for automatically assigning classes to electronic documents.

1. Introduction

            The Dewey Decimal Classification (DDC) is the most widely used library classification system in the world providing comprehensive subject arrangement for books and other materials, including electronic resources [1]. The twenty-first edition of the DDC, published in both print and electronic versions, exhibits several major enhancements, including expansions of the Dewey knowledge base, changes in structure, and revision and updating of terminology (Mitchell, 1997). Expansions involve the inclusion of new topics in the classification and extensions to existing classes. Structural changes include improvement of captions and expansion of the Relative Index. Terminology has been updated throughout the DDC to reflect currency, for instance, international usage, and sensitivity to the preferred usage of social groups, and national groups. According to Mitchell, these changes make the DDC easier to apply, reflect modern classification design principles, and support current and future uses of the underlying database.

            The goal of our research is to create an even more versatile version of the DDC--one that is capable of organizing large collections of electronic documents, especially on the Internet and WWW. To accomplish this, we have aligned our research efforts with those DDC 21 enhancements that adapt the DDC for use in electronic resource description and discovery:

            These three areas are critical for ongoing improvement if traditional classification schemes are to function effectively as knowledge structuring tools for electronic collections (Cochrane, 1996; Forest Press, 1995; Koch et al., 1997).

            In this paper we describe two complementary research efforts to enhance the DDC with supplemental vocabulary. One project focuses on effecting links between the DDC and other subject access systems; the other is concerned with associating current and end-user terminology from full text with the DDC.

2. Creating a More Versatile DDC

2.1 Linking DDC to the Library of Congress Subject Headings

            Starting with the Library of Congress Subject Headings (LCSH), researchers at OCLC are exploring both operational and experimental approaches to linking other subject access systems with the DDC. These associations take the form of: 1) DDC/LC subject heading links that have been reviewed editorially, 2) LC subject headings that are statistically related to DDC class numbers, and 3) links between Dewey indexing terminology and established headings or cross references in LC subject authority records. Techniques for handling emerging topics and popular terminology not represented by entries in a controlled vocabulary system are discussed in section 2.2.

2.1.1 DDC/LCSH (Editorial Links)

            The first method for linking DDC to LCSH is an ongoing service of OCLC Forest Press [2].The Dewey editorial staff review newly approved LC subject headings and pair them with candidate DDC numbers. These new headings provide classifier assistance for topics of current interest not mentioned explicitly in the classification schedules. Examples include,

Alien abduction

001.94

Amber fossils

562

Computer sex

025.063067, 306.70285

Open-book management

658.3152

Irradiated vegetables

664.88

Militia movements

322.42

Perl (Computer program language)

005.133

Teleportation

133.88

Linking the DDC to new topics (represented within LCSH) extends the Dewey knowledge base without demanding changes to Dewey's structure and defined relationships. As the DDC is revised and sufficient literary warrant is demonstrated, some of these topics may be assimilated into the classification as individual class numbers while others may be added to existing classes as Relative Index terms. LC subject headings not integrated into the DDC can continue to serve as supplemental indexing vocabulary. Editorially associated LC subject headings are shown in Dewey context for DDC class 004.678 Internet (figure 1). Without the associated LC headings this class number is connected with little subject-rich information, namely, one Relative Index term and one note with marginal subject-oriented terminology.

 

Figure 1. DDC enhanced with LCSH

Class Number:

004.678

Caption:

Internet

Notes:

Class a specific regional or national network with the area served, e.g., Janet 004.678094

See Manual at 004.678 vs. 025.04, 384.33

Use notation T1--019 from Table 1 as modified at 004.019

DDC Index Terms:

Internet

Editorially associated LCSH:

Internet consultants

HTTP (Computer network protocol)

Internet service providers

Statistically associated LCSH:

Internet (Computer network)

World Wide Web (Information retrieval system)

Internet (Computer network) in education

 

2.1.2 DDC/LCSH (Statistical Links)

             The second kind of association between Dewey class numbers and LC subject headings is based on statistical associations between the classification number and the first occurring subject heading in bibliographic records. Although there is not always a one-to-one correspondence between these two elements, the classification number and the first subject heading generally represent the predominant topic of a work (Library of Congress, 1996, H80). This type of link is illustrated by the statistically associated LCSH shown in Figure 1.

            In this case, the LC subject heading Internet (Computer network) appears as the first subject heading in 66% (26 of 39) of the LC-contributed records in WorldCat (the OCLC Online Union Catalog) containing class number 004.678. Statistically associated LCSH are listed in order of decreasing frequency of occurrence with the given DDC class number, e.g., World Wide Web (Information retrieval system) appears in 10% (4 of 39) of records with class number 004.678 and Internet (Computer network) in education in 5% (2 of 39) of the records. These statistics were computed for main headings + topical subdivisions only. Form subdivisions, such as Handbooks, manuals, etc. and Juvenile literature were dropped from headings since including them can lead to considerable dispersion of headings into separate, roughly equivalent groups. This approach generally produces satisfactory results, with an increase of six indexing terms in the previous example. Mismatches, however, do sometimes occur for Dewey class numbers that include aspects of topics brought out by subdivisions that are eventually dropped from headings during the statistical mapping process.

            In the previous example we illustrated how linking LC subject headings to the DDC, either editorially or statistically, can significantly improve a classifier's ability to locate topics of interest in the Dewey knowledge structure. This is especially true when the headings are integrated into electronic versions of the classification [3]. To facilitate enhancement of the DDC with supplemental terminology the Dewey editorial staff have begun tagging selected index terms to indicate their disposition in the print or electronic index to the classification (Mitchell, 1997, p. 13). The electronic Relative index includes the same entries as the print version plus additional terms supplied by the editors. Making such distinctions allows for the addition of a wide range of indexing vocabulary that can be used for various purposes including, classifier assistance, end-user access, and automatic subject categorization. See Vizine-Goetz (1997a) for a description of how revamped Dewey captions are being used as a retrieval device for a database of Internet resources.

            In the next section, a research effort will be described that links Dewey indexing terminology and LC subject headings and references for the purpose of automatically assigning subjects to electronic documents.

 

2.1.3 DDC/LCSH (Heading & Cross Reference Links)

            OCLC is engaged in a set of research projects that are exploring the potential for automatically assigning subjects to electronic documents[4]. In the Scorpion project, a series of ranked retrieval databases have been built from the machine-readable version of DDC 21 to address the challenge of cost effectively applying classification schemes to electronic information (Shafer, 1996). The Scorpion system can be accessed by means of a Web interface that retrieves an electronic document and generates a query from its content. The system generates ranked lists of Dewey numbers that function as possible subjects for documents. For example, the Scorpion system automatically assigned the class numbers shown in Figure 2 to the SIG/CR home page[5]. Although some of the classes are off the mark, many are directly relevant to the contents of the page:

  • 025.3

Bibliographic analysis and control

  • 001.4092

Researchers

  • 658.571

Fundamental research

  • 020

Library and information sciences

  • 025.4

Subject analysis and control

  • 020.7

Education, research, related topics

 A variety of techniques are currently being investigated to improve the accuracy of such Scorpion classifications.

 

Figure 2. Dewey classes assigned to the SIG/CR home page by the Scorpion system

 

 

            With the help of the Scorpion system, we are investigating the retrieval consequences of enhancing the DDC with term variants from LCSH (MARC tags 4xx). These include synonyms, variant spellings, variant forms of expression, and earlier forms of headings. Augmenting the DDC in this manner involves matching Dewey Relative index terms to authorized LC headings in the OCLC Authority File and then adding the LCSH variants to the corresponding Dewey record in experimental versions of the Scorpion Dewey databases. Examples of Dewey Relative index terms that match LCSH are shown in Figure 3 (e.g., Cataloging in publication and Indexing). Each matches one authorized LC heading (e.g., MARC tag 150). The resulting Scorpion Dewey record is shown in Figure 4. Note that the field labeled Library of Congress Subject Heading (s) contains the added terminology from LCSH.

             Preliminary experiments indicate that adding term variants from LCSH improves the ability of the Scorpion system to automatically assign subjects (classes) to electronic documents. For instance, using Scorpion Dewey databases without LCSH enhancements, classes 025.3 Bibliographic analysis and control and 025.4 Subject analysis and control failed to be retrieved among the top 30 retrievals when automatically classifying the SIG/CR home page. These investigations have also been quite valuable in suggesting additional techniques for associating LC headings with the Dewey knowledge base. Vizine-Goetz (1997b) describes how Scorpion is being used in the ExTended Concept Trees project to position LC subject headings in the DDC knowledge structure. In the next section we will describe how natural language parsing tools are being used to enhance the DDC with supplemental vocabulary from free text.

2.2 Importing terminology from free text

            In the work described so far, the sources of supplemental DDC terms are reference works maintained by cataloging and classification experts. Another rich source of terms is unrestricted text, now easily accessible from the World Wide Web. This text will always be ahead of reference works because, as any lexicographer knows, language may change faster than humans can analyze it. The following terms were recently extracted from Web-accessible newspaper texts: alternative rock band, anonymous FTP site, artificial life, baby-boomer parents, chat rooms, frames-capable browser, high-performance computing, and virtual mall. None of these concepts are found in the current version of the DDC. We are investigating the hypothesis that these concepts can be automatically mined and classified.

A straightforward way to do this involves three steps. First, the new vocabulary can be extracted from free text using statistical techniques developed by computational linguists that identify subject-bearing terms and phrases in large corpora. For instance, the techniques used in this study are similar to those described by Zhou & Tanner, 1997. Second, a context can be automatically built that contains the words that are highly associated with the target concept. Finally, this context can be used as a query to Scorpion. Godby, 1997 provides a more detailed description.

            For example, the phrase virtual mall appears 28 times in a 5-million-word corpus of descriptions of Web pages that were created for OCLC’s NetFirst database. From descriptions such as:

....presents the Hall of malls, a collection of online malls and other sites that are commercially oriented, such as shopping centers and the World Wide Yellow Pages. Provides access to many malls, including eMall, eShop Plaza, Virtual Mall....

it is possible to construct a context with the following words that are highly associated with virtual mall:

business, commercial, marketplace, net, networking, shopping, web...

 

Figure 3. Dewey Relative Index Terms and Matching LCSH

Dewey Classification Data

Class Number: 025.3

Caption: Bibliographic analysis and control

DDC Index Terms:

Bibliographic analysis
Bibliographic control
Cataloging--library science
Cataloging in publication
CIP (Cataloging)
Indexing
Indexing--information science
International Standard Book Number
ISBN (Standard book number)

OCLC Authority Record

OCLC Authority Record

[Fixed fields and indicators omitted]

010 sh 85020824

150 Cataloging in publication

450 Cataloging in source

450 CIP program

[Fixed fields and indicators omitted]

010 sh 85064867

150 Indexing

450 Books--Indexing

450 Index preparation

450 Preparation of indexes

450 Subject analysis

[MARC Tag 150 = Heading, Topical Term MARC Tag 450 = See From Tracing, Topical Term]

 

Figure 4. Scorpion Dewey Record for DDC Class 025.3 Bibliographic Analysis and Control

When this context is used as a query to Scorpion, the top five categories listed below are retrieved, revealing the two obvious facets of a concept that refers to shopping on the Internet:

  • 004.6

Interfacing and communications

  • 004.678

Internet

  • 025.04

Information storage and retrieval

  • 380.

Commerce (trade)

  • 381.1

Retail trade

             The major goal of this work is to provide additional indexing vocabulary for the DDC. This vocabulary can be used in several ways. The concepts extracted from free text and their candidate classifications can be reviewed by the DDC editorial staff for possible inclusion in future editions of the DDC. The new concepts also provide end-user vocabulary that enables electronic versions of the DDC to be customized to a particular database, enhancing the DDC’s utility as a browsing and searching tool.

             Finally, the concepts extracted from free text can supplement the effort to map the LCSH to the DDC. Some phrases extracted from free text, such as artificial life, are in the current edition of the LCSH, but there may be few records in WorldCat (in this case, only four) that support the type of statistical link to the DDC described in Section 2.1. The method described here could be used to fill in the gap. With a context built from the same 5-million-word corpus of descriptions of Web documents that were used to create the previous example, Scorpion gives the following top five categories for artificial life:

  •  006.3

Artificial intelligence

  • 153

Conscious mental processes and intelligence

  • 006.333

Deduction, problem-solving, reasoning

  • 006.37

Computer vision

  • 006.31

Machine learning

             These results are quite encouraging, given that the subject indexing in the WorldCat records containing the LC subject heading Artificial life includes DDC number 006.3 and the LC heading Artificial intelligence.

3. Summary

            In this paper we show how the indexing vocabulary of the DDC knowledge base can be enhanced through editorial and statistical mapping of LCSH. In many cases, mapped headings represent current and popular topics not represented by existing captions or Relative index terms, but that are already within the scope of Dewey's structure. We further show how terminology imported from free text can form a bridge between prevailing language usage and the more disciplined Dewey editorial process for admitting new topics and terminology to the Classification. Through these complementary efforts we are able to experiment with a more versatile version of the DDC that may prove to be suitable for automatically assigning classes to electronic documents.

 4. References

Cocharane, P. A. (1996). New roles for classification in libraries and information networks. Cataloging&Classification Quarterly. 21, 3-4.

Forest Press. (1995). Dewey Decimal Classification research agenda. [Document posted on the World Wide Web]. Retrieved May 20, 1997 from the World Wide Web: http://www.oclc.org/oclc/fp/research/agenda.htm

Godby, C. J. (1996). Enhancing the indexing vocabulary of the Dewy Decimal Classification. Annual Review of OCLC Research 1996. [Article posted on the World Wide Web]. Retrieved May 20, 1997 from the World Wide Web: http://www.oclc.org/oclc/research/publications/review96/vocabulary.htm

Koch, T. (1997). The role of classification schemes in Internet resource description and discovery. [Development of a European Service for Information on Research and Education (DESIRE) project report posted on the World Wide Web]. Retrieved May 29, 1997 from the World Wide Web: http://www.ukoln.ac.uk/metadata/DESIRE/classification/

Library of Congress. (1996, August). Subject cataloging manual: Subject headings. Washington, DC: Library of Congress.

Mitchell, J. S. (1997). DDC 21: an Introduction. In Dewey Decimal Classification: Edition 21 and International Perspectives. Albany, NY: Forest Press, 3-15.

Shafer, K. E. (1996). A brief introduction to Scorpion. [Document posted on the World Wide Web]. Retrieved May 20, 1997 from the World Wide Web: http://orc.rsch.oclc.org:6109/bintro.html

Vizine-Goetz, D. (1997a). Classification research at OCLC. Annual Review of OCLC Research 1996. [Article posted on the World Wide Web]. Retrieved May 20, 1997 from the World Wide Web: http://www.oclc.org/oclc/research/publications/review96/class.htm

Vizine-Goetz, D. (1997b). OCLC Investigates using classification tools to organize Internet data. OCLC Newsletter, 226, 14-18. [Also as article posted on the World Wide Web]. Retrieved May 20, 1997 from the World Wide Web: http://www.oclc.org/oclc/new/n226/research.htm#investigates

Zhou, J. & Tanner, T. (1997). Construction and visualization of key term hierarchies. In Fifth conference on applied natural language processing: proceedings of the conference, 31 March - 3 April, 307-311. Somerset, NJ: Association for Computational Linguistics.

 

5. Notes

[1] The OCLC NetFirst database contains bibliographic citations describing electronic resources including World Wide Web pages, interest groups, library catalogs, FTP sites, Internet services, Gopher servers, electronic journals, and newsletters. Each item is cataloged with Dewey classification numbers. 

[2] This service is described in OCLC news release "LC Subject Headings and Dewey Numbers Linked on Web". Accessible at: http://www.oclc.org/oclc/press/961206.htm

[3] Dewey for Windows includes up to five Library of Congress (LC) Subject Headings that are frequently used with a given class number. The associations are based on a statistical analysis of LC-contributed records in the OCLC® Online Union Catalog (OLUC). The LC Subject Headings offer additional lead-in terminology and help in assigning LC Subject Headings.

[4] Current projects in the OCLC Office of Research are reviewed in the Jan./Feb. 1997 issue of the OCLC Newsletter, No. 225. Classification-oriented projects include: Automatic Cutter Number Assignment, the Dewey 2000 project, the Scorpion project, and the WordSmith project. This issue is accessible at: http://www.oclc.org/oclc/new/n225/fs.htm#cur

[5] Special Interest Group / Classification Research Home Page. Last updated April 26, 1997. Accessible at: http://newarkwww.rutgers.edu/asis.sigcr/index.shtml