OCLC Internet Cataloging Project Colloquium
Position Paper

Using Library Classification Schemes for Internet Resources

by Diane Vizine-Goetz
OCLC Office of Research


Contents


Classification experts and librarians have long recognized the potential of library classification schemes for improving subject access to information. In a 1983 article, Svenonius describes several uses for classification in online retrieval systems, including the following, (1) to improve precision or recall, (2) to provide context for search terms, (3) to enable browsing, and (4) to serve as a mechanism for switching between languages. In the Dewey Decimal Classification (DDC) Online Project (Markey and Demeyer 1986), Markey demonstrated the first implementation of a library classification scheme for end-user subject access, browsing, and display. Although many online catalogs provide call number browsing, few employ classification in the manner described by Svenonius or explored by Markey in her innovative use of the DDC in an experimental online catalog which enabled users to search and browse online classification data. Only recently, some ten years after Markey's pioneering research, is online classification data once again being seriously viewed as a tool for providing advanced browsing and retrieval capabilities in online systems.

Online Classification Data

One factor that has contributed to the slow adoption of classification as a retrieval tool is that the DDC and the Library of Congress Classification (LCC) have only recently been converted into machine-readable form. The computerization of the DDC began with the production of DDC 19 (1979) from computer-based photocomposition tapes. This development and the Markey study prompted Forest Press, in 1984, to commission Inforonics to develop an online editorial support system (ESS) for the Dewey Classification. See Finni and Paulson (1987) for a description of the development of the Dewey ESS. The resulting system and database was used to produce DDC 20 (1989), the first classification to be produced using an online editorial support system.

A different path is being taken in the conversion of the LCC to machine-readable form. Recognizing the benefits of online classification data for maintenance and distribution of LCC, the Library of Congress began developing the USMARC Format for Classification Data in 1987. The format was given provisional approval in 1990 and shortly afterward the Library of Congress began converting the forty six LCC schedules. The LCC database is expected to contain more than 450,000 classification records when complete. See Guenther (1991) for a summary of the development and implementation of the USMARC classification format.

Classification and the Internet

Electronic versions of the DDC and LCC make it possible to realize the potential of library classification to improve subject retrieval; however, much of the renewed interest in classification as an organizing and retrieval device for information resources has been sparked by the growth in usage of the Internet and World Wide Web (WWW).

Several WWW sites give users the ability to perform word or phrase searches to retrieve items of interest, with two popular sites, Yahoo and Infoseek, providing the additional capability of allowing users to navigate through a series of subject categories to discover potentially relevant documents. Although, Yahoo and Infoseek use essentially the same input (WWW documents and Internet newsgroup files) as the basis for their subject structures, the resulting categories displayed to users are quite different. The broad subject "Education" is found at the top level of both Yahoo and Infoseek, however, the next level under area "Education" reveals a very different organization of the topic in each system. In Yahoo (see appendix A) [http://www.yahoo.com/] over 30 sub-categories are available for browsing education-related topics while Infoseek (see appendix B) [http://guide.infoseek.com/] presents a leaner outline.

Library classification schemes have long provided a similar organizing tool for library materials. The subject categories found in the DDC and LCC are based largely on the topics expressed in monographic material in traditional book format. For printed books, the Dewey Summaries [http://www.oclc.org/fp/] and LC Classification Outline are the library community's functional equivalent to the subject categories of Yahoo and Infoseek. In fact, several noncommercial WWW sites are using DDC and LCC to provide subject access to Web-accessible documents. Some examples are:

DDC

The UK Web Library - Searchable Classified Catalogue of UK Web sites
[http://www.scit.wlv.ac.uk/wwlib/newclass.html]

CyberDewey: A guide to Internet resources organized using Dewey Decimal Classification codes
[http://ivory.lm.com/~mundie/CyberDewey/CyberDewey.html]

Morton Grove Public Library Webrary
[http://www.webrary.org/ref/weblinksmenu.html]

LCC

CyberStacks(sm) [http://www.public.iastate.edu/~CYBERSTACKS/homepage.html]

At a time when both Internet-based classification schemes and traditional library classification systems are being used to provide access to Internet resources it is appropriate to review the major characteristics of DDC and LCC and to assess whether the electronic versions of these schemes can be successfully extended to the Internet.

Major Characteristics of DDC and LCC

DDC and LCC Are General Classification Systems

Chan, Comaromi, and Satija remind us that the purpose of the Dewey Decimal Classification is to arrange a general collection of materials--"[the DDC] aims to classify books and other material on all subjects in all languages in every kind of library ... ." Similarly, LCC is designed to provide order for a general collection, the collection of the Library of Congress. Although based on the collection of a single library, the LC Classification has been successfully adopted by a majority of U.S. academic and research libraries.

To determine how well library classification systems compare to Internet classifications in terms of general topic coverage, categories 1-10 and 35-45 of Yahoo's 50 most popular categories were compared to DDC and LCC. The results are shown in table 1. All but four Yahoo categories (7, 36, 41, and 45) mapped to explicit DDC or LCC numbers or ranges. Although DDC and LCC both contain provisions for subdivision by geographical area within topics and a geographical breakdown for historical works, no direct mapping could be made for categories 36 and 45 which are essentially geographic areas subdivided by topic. For category 7 (Magazines) all three schemes provide a topical breakdown. Category 41 (Humor, Jokes, and Fun) is the most disperse when translated to DDC and  LCC.   In Dewey, humorous material can be classed by the specific literature or literary form, with specific subjects, etc. A similar situation exists in LCC. The mappings of the other categories indicate that DDC and LCC have sufficiently wide topic coverage for classifying Internet resources. This result is not surprising given that DDC and LCC numbers have been successfully assigned to more than 1.5 million items by the Library of Congress alone, resulting in more than 340,000 unique LCC classes and 280,000 unique DDC classes.

Table 1

Yahoo* DDC LCC
1. Entertainment Performing arts (791-792) and by subject Performing arts (PN) and by subject
2. Computers and Internet Computers; Internet (004-006) Computer Science; (QA76+) & Telecommunication (TK 5105)
3. News News media; Broadcast media (070.1+; 302.23+) Newspapers (AN), Journalism & Broadcast news (PN4699-5648)
4. Recreation Recreation (793-799) Recreation. Leisure (GV)
5. Business and Economy Economics (330-390) Economics (H-HJ)
6. Society and Culture Religion (200), Social groups (305) & Culture and institutions (306) Religion (BL-BX) Sociology (HM), The family. Marriage. Women (HQ), Social and Public welfare (HV)
7. Entertainment: Magazines General periodicals (050) and by subject General periodicals (AP) and by subject
8. Entertainment: Movies and Films Motion pictures (791.43) Motion pictures (PN1995.5)
9. Education Education (370) Education (L)
10. Arts The Arts (700-799) Fine Arts (N) and by topic
35. News: International International news (070.4332) Newspapers (AN) and by place, event
36. Regional: Countries No direct mapping; geographical; treatment by subject or historical treatment by geographical area No direct mapping; geographical; treatment by subject or historical treatment by geographical area
37. Arts: Photography Photography (770) Photography (TR1-1050)
38. Computers and Internet: Multimedia Multimedia systems (006.6) Computer Science (QA76+) and by subject
39. Entertainment: People Performers (Entertainers) (791.092) Fine Arts: Performing arts (NX1-820)
40. Society and Culture: Relationships: Dating Social Sciences: Customs: Life cycle: Dating (306.7+; 392.6; 646.7+) Social Sciences: The Family. Marriage. Woman: Dating (HQ801-801.83)
41. Entertainment: Humor, Jokes, and Fun No direct mapping; by literary form, subject, etc. No direct mapping; by literary form, subject, etc.
42. Business and Economy: Markets and Investments Finance and investments (332.6) Social Sciences: Finance (HG)
43. Social Science Social Sciences (300-399) & History (900-999) Social Sciences (H-HX) & History (D-DL, DS, DT, E-F)
44. Entertainment: Television: Shows Television (791.45) Drama: Television: Broadcasts (PN1992.8)
45. Regional: U.S. States No direct mapping; geographical; treatment by subject or historical treatment by geographical area No direct mapping; geographical; treatment by subject or historical treatment by geographical area

* On February 9, 1996 sites 1-45 of the top 50 sites were given at http://www.yahoo.com/text/popular.html

A more detailed comparison of Yahoo and DDC was performed to further examine the suitability of library classification schemes for providing access to Internet resources. The Education high-level outline on Yahoo [figure 1] was brought together with portions of the Dewey Edition 21 Education outline [figure 2] to determine how the two systems differ in scope and coverage. DDC caption headings in figure 2 have been edited for brevity. Of the 39 subcategories under education on Yahoo, 27 mapped to one or more classes in the DDC education schedules. Category 21 "K-12" was the most disperse, mapping to 4 different DDC caption headings. Of the 27 topic areas, most mapped to DDC classes 1 to 3 levels deep, and only 4 ( those marked with an asterisk) were 5 levels down in the DDC hierarchy. The categories "Conferences," "Companies," "Databases," Journals," "Magazines," "News," and "Products" are represented by standard subdivisions in Dewey and are not shown in figure 2 but could be listed under the general caption heading for education in Dewey or under the specific aspects of education covered by the item . The categories "Courses" and "Programs" which can map to many places in the DDC education schedule (e.g., school lunch programs, multi-cultural education programs, work-study programs, etc.) were also omitted from the figure 2 display but counted as matching categories. Only three of the Yahoo categories ("Lectures," "Libraries," and "Interest groups") mapped to DDC classes outside the DDC education schedule. This analysis indicates that DDC possesses sufficient depth of coverage in its schedules and tables to be considered a viable tool for accessing Internet resources.

Figure. 1. Yahoo Education High-Level outline



Figure. 2. Dewey Edition 21 Education Outline



DDC and LCC Have a Hierarchical Structure

Williamson points out that:

Hierarchical relationships are the essence of all classification. Enumerative classifications systems provide a systematic arrangement of subjects according to set of principles based on an accepted philosophy of the organization of knowledge, on patterns established on the basis of literary warrant, and frequently, on a combination of both. However, classified order is not self-evident. Some method or device is required to preserve the relationships among classes, subclasses, topics and subtopics. In some classification systems, for example DDC, these relationships are preserved and may be manipulated through the hierarchical notation. LCC does not fit this pattern. Its notation preserves order but does not reflect hierarchy. ... some other means must be found to preserve those relationships.

In DDC, the sequence of subjects from general to specific, is indicated by the number of digits that form the DDC number. For example, when the DDC number 663.223 for the topic "making of red wine" is shown in the context of its Dewey hierarchy it can be seen that "White Wine," "Red Wine," and "Sparkling Wine" are at the same hierarchical level. The DDC number 663.22 corresponding to the heading "Specific kinds of grape wine" is one digit shorter than those used to indicate specific kinds of wine and is considered to be broader or superordinate to those with longer numbers. Indentation is also used to indicate hierarchy. Through both notation and indentation, this example shows that each topic except for the main class 600 Technology is subordinate to and part of all the broader classes above it.

600 Technology (Applied sciences)
660   Chemical engineering and related technologies
663     Beverage technology
663.2       Wine and wine making
663.22         Specific kinds of grape wine
663.222           White wine
663.223           Red wine
663.224           Sparkling wine

In both the LC Classification and in Yahoo's category trees, hierarchy is indicated by the indentation of category or class labels. To illustrate, consider the following class numbers and headings listed in the LC Classification QA schedule:

QA76.33 Computer Camps
QA76.38 Hybrid Computers
QA76.4 Analog Computers
QA76.5 Digital Computers

Given only the notation (class numbers) and captions (headings), it is unclear what relationship exists among the ordered classes. When these classes are placed in the context of the LCC hierarchy structure in a display similar to what is found in the printed schedules, the indentation clearly indicates that Hybrid, Analog, and Digital computers are the same level of hierarchy and that QA76.33 is a subcategory under study and teaching and not a type of computer.

QA71-QA90 Instruments and machines
QA75-QA76.95     Calculating machines
QA76       Electronic computers.  Computer Science
QA76.27         Study and Teaching
QA76.33           Computer Camps
QA76.38         Hybrid Computers
QA76.4         Analog Computers
QA76.5         Digital Computers

Yahoo subcategory trees also use indentation to indicate hierarchy. For example, the following hierarchy is found under "Computers and Internet"

Computers and Internet
  Internet
    Entertainment
     Interesting Devices Connected to the Net
       Spy Cameras
         Indoor Cameras
         Outdoor Cameras
         Pets@
           Aquariums

The preceding examples demonstrate that both Internet classification schemes and library classification schemes provide hierarchical structures capable of supporting topic browsing. Library schemes would seem to have some advantage over Internet-based schemes because they are accompanied by notations that facilitate the manipulation of class relationships. Recall that DDC's notation can be used to navigate broader, narrower, and coordinate relationships among classes, while LCC's can be used to arrange related topics in order. Yahoo's hierarchy structure requires encoding to take advantage of relationships among classes.

Library classification schemes are generally considered to be retrospective: classes are added or revised only after sufficient literary warrant is demonstrated and classes are removed with even greater caution. For these reasons much greater attention needs to be given to employing the implicit and explicit links between library classification systems and other subject oriented schemes. For example, Electronic Dewey, the electronic version of DDC20, includes a statistical mapping from the OCLC Online Union Catalog of up to five of the most frequently used LCSH to each Dewey number. This Electronic Dewey feature, which has been well received by users, provides additional indexing terms to lead users to appropriate topic areas in Dewey. In addition to statistical mappings, the Electronic DDC21 database will include many DDC/LCSH links that have been reviewed editorially. Links similar to those made in Electronic Dewey can be made for LCC and LCSH by processing the bibliographic records containing fields for both. For LCSH and LCC explicit links are also available in LC Subject Authority records that contain LC classification number fields. In an analysis of the LC Subject Authority file, Vizine-Goetz and Markey found that about 43% of topical subject heading records (MARC tag 150) contain LC classification number fields. Science and technology classes account for almost half (47.72%) of the LC class numbers.

In addition to providing supplemental vocabulary for topics already represented in class schedules, linking DDC with other subject thesauri provides a mechanism for allowing new topics to be represented in the classification even if each is not supplied with its own number. For example, the LC subject heading Microsoft Network (Online service), listed among the "Subject Headings of Current Interest" in CSB, No. 70 (Fall 1995), can be linked to DDC number 025.04 "Automated information storage and retrieval systems" and to DDC number 004.678 "Internet." This subject heading has been assigned to only four LC MARC records with DDC number 025.04 and therefore may not be among the top 5 LCSH statistically mapped to this number. The ability to map current terminology into DDC and LCC is particularly important if library classification schemes are to be used to provide access to Internet resources.

Links to Editions in Other Languages

Classification experts have long recognized the potential of DDC to serve as a mechanism for switching between languages. With the recent publications of DDC in French and Spanish and with a Russian translation scheduled for publication in December 1997, it may now be possible to realize this capability.

Table 2 shows DDC captions in English, French, and Spanish for three DDC classes on the topic microcomputers. Captions and relative index terms in translation databases could be used to provide a multilingual subject browser to a database of Internet-accessible resources that have been assigned DDC numbers, such as OCLC's NetFirst database.

Table 2. DDC Captions

Class Number English  Spanish French
004.1 General works on specific types of computers Obras generales sobre tipos específicos de computadores Ouvrages généraux sur les différents types d'ordinateurs
004.16 Digital microcomputers Microcomputadores digitales Micro-ordinateurs
004.165 Specific digital microcomputers Microcomputadores digitales específicos Micro-ordinateurs particuliers

 

Library Classification or Internet-based Schemes?

This paper examines several characteristics of DDC and LCC classification schemes that make them suitable for providing subject access to Internet resources. To review, DDC and LCC are  

Despite these favorable properties additional improvements are needed if online classification data is to be used as a major tool for providing online subject access to traditional collections as well as to Internet-accessible resources. The following improvements are recommended:

  1. Evaluate DDC and LCC captions for expressiveness and currency
  2. Decompose and code class number components to identify the specific subject and aspects represented
  3. Continue to add new terminology as index terms even if each is not supplied with its own number
  4. Expand links to other controlled vocabularies
  5. Expand definitions of literary warrant to include Internet resources
  6. Build demonstration systems

Recommendations 1 and 2 are not new. Over ten years ago, Karen Markey advocated similar improvements be made to the DDC to facilitate its use in online catalogs. Arnold Wajenberg (1983) proposed a scheme for encoding DDC numbers to enhance automated subject retrieval. Fortunately at present, there appears to be both interest and resources to make needed enhancements to online classification data. For example, a project has been established to transform captions in the 1000 DDC summaries (the first three digits in Dewey) into end user language [DDC ALA Midwinter Conference Report, January 1996; http://www.oclc.org/oclc/fp/news/9602ala.htm]. The recast summaries will be used in the prototype of a Dewey-based subject browser for the NetFirst database of Internet-accessible resources. While a good first step, it will be necessary to look well beyond the first three levels of Dewey to captions at lower levels of the DDC hierarchy since many of the DDC numbers assigned to NetFirst records extend four or more digits past the decimal point. These efforts will advance recommendations 1 and 6.

In the context of DDC, work on recommendations number 3 and 4 is also underway. Dewey editorial staff and OCLC research staff are collaborating on projects to enhance the electronic version of the Dewey editorial database with selected LC subject headings from the Weekly Lists [About the Subject Headings Weekly Lists; http://www.loc.gov/catdir/cpso/cpso.html] and headings with high postings in the OCLC Online Union Catalog database. Editorial staff will also add coding to the Editorial Support System database to indicate links between Dewey Relative Index terms and LCSH and Sears headings. For LCC, the LC Cataloging and Policy Support Office is reviewing the index structure of the LCC schedules and is consulting with classification expert Lois Chan on the design of a combined index to LCC. It is very likely that this work could lead to future efforts to form better links between LCC and LCSH.

The projects described above indicate a commitment by the owners and maintainers of DDC and LCC to improve these systems for automated subject retrieval. If Internet resource catalogers display a similar commitment to assigning class numbers to the bibliographic records they create, online classification data can form an important bridge between library methods for organizing materials and Internet-based techniques for accessing electronic collections. Furthermore, DDC and LCC based interfaces will provide users with a common interface to traditional and electronic libraries.

References

Chan, Lois Mai, John P. Comaromi and Mohinder P. Satija. 1994. Dewey Decimal Classification: a practical guide. Albany, N.Y.: Forest Press. p. 6.

Finni John J. and Peter J. Paulson. 1987. "The Dewey Decimal Classification enters the computer age: developing the DDC database(TM) and Editorial Support System." International Cataloguing 16 (4 ):46-48 (October/December 1987).

Guenther, Rebecca S. 1992. "The Development and Implementation of the USMARC Format for Classification Data." Information Technology and Libraries 11 (2):120-131 (June 1992).

Markey, Karen, and Anh N. Demeyer. 1986. Dewey Decimal Classification Online Project: Evaluation of a Library Schedule and Index Integrated into the Subject Searching Capabilities of an Online Catalog. Dublin, Ohio: OCLC Online Computer Library Center, Inc., Office of Research .

Svenonius, Elaine. 1983. "Use of classification in online retrieval." Library Resources and Technical Services 27(1):76-80 (Jan./Mar. 1983).

Vizine-Goetz, Diane and Markey, Karen. 1989. "Characteristics of Subject Heading Records in the Machine-Readable Library of Congress Subject Headings." Information Technology and Libraries 8(2): 203-209 (June 1989).

Wajenberg, Arnold S. 1983. "MARC Coding of DDC for Subject Retrieval." Information Technology and Libraries 2(3): 246-251 (September 1983).

Williamson, Nancy J. 1995. The Library of Congress Classification: a content analysis of the schedules in preparation for their conversion into machine-readable form. Washington, D. C.: Library of Congress, Cataloging Distribution Service. p. 17.

Acknowledgments

The author is grateful for the valuable comments of Joan S. Mitchell, Editor Dewey Decimal Classification and is also grateful for assistance from Barbara A. Brownell, Technical Processing Specialist, OCLC.

Appendix

A. Yahoo Education Topics

yahoo_education.jpg (72865 bytes)

B. Infoseek Education Topics

infoseek.gif (38706 bytes)

Last edited 11/23/1999


Back to beginning