Automatic classification research at OCLC

OCLC enlists the cooperation of the world's libraries to make the written record of humankind's cultural heritage more accessible through electronic media. Part of this goal can be accomplished through the application of the principles of knowledge organization. We believe that cultural artifacts are effectively lost unless they are indexed, cataloged and classified.

Accordingly, OCLC has developed products, sponsored research projects, and encouraged the participation in international standards communities whose outcome has been improved library classification schemes, cataloging productivity tools, and new proposals for the creation and maintenance of metadata. Though cataloging and classification requires expert intellectual effort, we recognize that at least some of the work must be automated if we hope to keep pace with cultural change.

Our research explores the following questions:

  1. Can standard library classification schemes such as the Dewey Decimal Classification and the Library of Congress Classification be adapted to classify materials automatically-especially Web resources and other digitized electronic documents? Is there a role for indexes and topic maps that are obtained directly from source documents?
  2. What improvements to automatic classification systems can be made to get as close to human performance as possible? How useful is the result? Can the results be used in subject browsing and searching or the creation of minimal metadata records? Should an automatic classifier be included in toolkits for Webmasters or other human-mediated processes?

In a recent article in Scientific American, Tim Berners-Lee argues that the current Web will be transformed into the more intelligent Semantic Web when it is augmented with data for automated processing. For the Semantic Web to work, we must have access to structured collections of information, as well as appropriately annotated Web pages. Since some of the goals of our work are consistent with this vision, we are assessing the utility of the Resource Description Framework (RDF), a building block of the Semantic Web, in making Web documents more accessible by subject.

We draw on our background in information retrieval, Web metadata creation, natural language processing, and our intellectual involvement in two of the world's most widely used library classification schemes to address our reserch questions.

Project papers and presentations

Jean Godby and Devon Smith. "Strategies for Subject Navigation Using RDF Topic Maps" Presented at the Knowledge Technologies 2002 Conference. Seattle, Washington, March 2002.

Jean Godby and Jay Stuler."The Library of Congress Classification as a knowledge base for automatic subject categorization." Presented at the IFLA Preconference "Subject Retrieval in a Networked Environment," Dublin, Ohio, August 2001.

Jean Godby. "Terminology identification from full text: OCLC's WordSmith experience." Presentation at The Southern Ohio Chapter of the American Society for Information Science & Technology (SO-ASIST) meeting "Aboutness: Automated Indexing & Categorization," Lexis-Nexis, Miamisburg,Ohio, June 21, 2001.

Jean Godby. "The automatic encoding of lexical knowledge in RDF topicmaps." Presentation at the Knowledge Technologies 2001 Conference, Austin, Texas. March 6, 2001.

Jean Godby and Ray Reighart, 2001. "Terminology identification in a collection of Web resources." The Journal of Internet Cataloging, Volume 4, Numbers 1/2, pp. 49-65. Also published in CORC: New Tools and Possibilities for Cooperative Electronic Resource Description, edited by Karen Calhoun and John J. Riemer. The Haworth Information Press, pp. 49-65.

Jean Godby and Diane Vizine-Goetz, 2000."ISKO participants discuss ways librarianship can improve responsiveness of the Web" OCLC Newsletter. No. 247:pp. 20-21 (September/October).

Anders Ardö, Jean Godby, Andrew Houghton, Traugott Koch, Ray Reighart, Roger Thompson and Diane Vizine-Goetz. "Browsing Engineering Resources on the Web." In Beghtol, L., Howarth, L. and Williamson, N. (editors), Dynamism and Stability in Knowledge Organization: Proceedings of the Sixth ISKO Conference, 10-13 July, 2000 pp. 385-390.

Jean Godby, Eric Miller and Ray Reighart. "Automatically Generated Topic Maps of World Wide Web Resources." Presentation at the Ninth International World Wide Web Conference, Developers' Day Session on the Semantic Web, May 15, 2000.

Current projects

The Library of Congress Classification as a knowledge base for automatic subject assignment. On the Scorpion demo page, choose the database entitled "Schedules QRST with filtered WorldCat and LCSA Headings." It contains adaptations of the schedules for Science, Medicine, Technology and Agriculture. For inquiries, contact Jean Godby (godby@oclc.org).

Experiments with the Semantic Web. Click here to access the Web page that describes our Open Source TopicMap demo. For inquiries, contact Devon Smith (smithde@oclc.org).

Links to related OCLC projects

The Forest Press Home Page

Knowledge Organization Research

The Library of Congress Schedule H as a hierarchical browse display.

The Scorpion Project

The WordSmith Project

Links to related external sites

The Dublin Core Home Page

The Electronic Engineering Library, Sweden

The Library of Congress Home Page

The Resource Description Framework

The Semantic Web

Project Team

Carol Jean Godby (godby@oclc.org) is a Consulting Research Scientist in the Office of Research at OCLC and manager of the Automatic Classification project. She recently finished a Ph.D.in linguistics at The Ohio State University.Her dissertation, A Computational Study of Lexicalized Noun Phrases in English, is available through the OhioLink option of the Electronic Theses and Dissertations initiative.

Devon Smith (smithde@oclc.org) is a Systems Analyst in the Office of Research at OCLC. He is a recent computer science graduate of Case Western Reserve University, where he developed expertise in Perl and Java programming, LINUX, and computer security.

Jay Stuler (stuler@oclc.org) is a Technical Intern in the Office of Research at OCLC. He is a Computer Science and Engineering undergraduate at The Ohio State University.

Updated March, 2002