Searching Digital Libraries

Ralph LeVan
OCLC Online Computer Library Center, Inc.

This paper was produced as a position paper for an
NSF-EU funded workshop on searching digital libraries.

 

Abstract

This paper presents the view that digital objects need to be searchable in much the same manner, from an end-user perspective, as text objects on the Internet are now and that these exist reasonable means to implement such searching. A description of the problematic aspects of such searching is given, with one or more reasonable solutions supplied for each, though it is clear that more research needs to be done on finding optimal methods.

 

Introduction

Text is easy. Everything else is hard. The success of the World Wide Web (Web) is in no small part due to the fact that the Web designers chose to base the Web on text (HTML) documents. Z39.50, the ISO standard protocol for searching text succeeded because it started with a simple subset of text (bibliographic records) and is only recently extending into more interesting text records. But, for all the attention these two protocols are getting, there is a lot of non-text (digital) data available on the Internet.

Searching local data stores is easy. Searching someone else's data store is hard. When building a local database, local tools are used and, hopefully, local tools exist to search the database as well. But, when searching someone else's database, local tools are usually incompatible, leaving the messy option of downloading their database and converting it to a local database which can then be searched with the local tools. If luck holds, data conversion will net be needed. This is a slow and painful way to search for things on the Internet.

Discovering text data on a local machine is easy. Discovering text data on the Internet is hard. Discovering digital data anywhere is really hard. It is easy to pop up the "Find" tool on a Windows machine and search for file names or for files containing specific text. There are numerous Internet Directories that support searching for HTML documents that the directory operators have stumbled across on the Internet that contain specific text. But, there is nowhere to search for files that contain references to a boiling point of 55 degrees Celsius, unless it is encoded in text.

 

What Are We Searching?

The object of this paper is to discuss how digital objects can be found and retrieved on the Internet. What are these digital objects? Sometime they are called binary objects or binary files. Essentially, they are anything that is not text. They can be pictures or sound files or spread sheets. Their distinguishing feature is that they do not contain text that can be used for searching. A site with more than one digital object is said to have a digital collection.

So, how do we search for these objects? There are two ways to search for digital objects: search for text descriptions of the files that, hopefully, contain pointers to the digital objects or search the digital collections directly. The first technique allows us to leverage that "text is easy" claim. The problems with searching these text surrogates for the real digital objects relate to the degree of difficulty in generating the surrogates and the degree to which searching these surrogates approximates the searching that might be done against the actual digital object.

Searching the digital objects directly suffers from a lack of ubiquitous tools. SQL and OBDC based tools come close to doing what is needed, but suffer from deficiencies when used to search other people's data and are only really good for relational data.

 

Searching Surrogates

The ability to search for digital objects using text descriptions depends on the quality and quantity of the description. The descriptive data for an object, whether it is a digital or text object, is commonly called metadata in the Internet world. For text objects, the metadata might include the author of the text, a free-text description of the text (an abstract) and maybe some subject headings chosen from a controlled vocabulary such as the Library of Congress Subject Headings (LCSH). Digital objects can be similarly described. But, where will this metadata come from?

 

Manually Created Metadata

Most of the text descriptions of digital objects are manually entered. Interdisciplinary groups, such as the Dublin Core Workshops, are trying to provide general guidelines for the kinds of metadata that should be provided with digital objects. Specific communities of digital object creators have their own guidelines for descriptive metadata. Other groups (e.g. W3C Resource Description Framework Working Group) are developing guidelines for transmitting the descriptions along with the digital objects. But, the end result depends on the expertise of the person doing the description and the amount of time that can be invested in generating the description.

A missing component is a tool for entering metadata and keeping it associated with the digital object. Microsoft Word provides a facility for entering descriptive information about the documents created with it.. Author, title, subject, and other data as well, can all be specified. No such facility is available in the image creation tools. A ubiquitous tool, probably built into the operating system, would make it much easier to create such metadata for objects that are intended to be shared.

 

Automatic Description Tools

An alternative to manually created metadata is programmatically generated metadata. Such tools are being created today for text data. These tools analyze the contents of free-text records and generate controlled vocabulary descriptions of the records. Such tools should be possible for digital records.

 

Searching Beyond Text

Why are there no tools for searching binary objects directly? This is partly due to the wide number of formats for binary objects. The same problem existed for text objects, until HTML became the de facto text format standard. It is unlikely that a tool for searching for images in JPEG files will also be able to find values in spreadsheets.

A promising source of widely searchable binary objects is the large number of relational databases on the Internet. These databases could all be searchable through ODBC or some other relational query protocol. Such capability existed long before Z39.50, but never fulfilled the promise it offered. This was largely due to the inability of the protocols and database servers to treat the digital objects abstractly. An exact knowledge of the schema implemented in a database was required to search the database. This requires either that the searcher knows about all the schemas of all the databases on the Internet or that all the databases be built with the same schema. Neither requirement is practical.

A subgroup of the Z39.50 Implementors Group (ZIG) is looking into applying the abstracting mechanisms of Z39.50 to SQL based searching. This may result in a tool that will allow searchers to ask questions of relational databases without knowing their structure.

 

How Do We Search The Internet?

Assuming that tools for searching for binary objects (or their surrogates) exist, how do we use them to search the Internet? Do we have to search every site on the Internet? How do we discover all those sites? What if a site doesn't want to support the millions of potential searches that might be run against it? There are several models of Internet searching that address these and other issues.

 

Centralized Searching

The common model for Internet searching is the central directory. These central directories attempt to collect all the web pages from all the web servers on the Internet. When a search is performed, it is performed against the directory's copy of the web pages instead of searching all the web servers directly.

This is a very efficient model. Web servers are visited once per page per directory service. These pages are re-retrieved periodically on the chance that they might have changed. While this might be a large number of visits, it is much less than the number of Internet searches being performed.

The greatest potential problem with this model is scalability. Essentially, these directory services maintain a copy of all the web pages on the Internet. So far, the cost of disk space has decreased and the speed of the processors has increased at a rate sufficient to make this searching practical. But, if we add the text surrogates for all the binary objects on the Internet to the existing collection of web pages, then we may strain the Internet directories beyond their capacity. This would almost certainly be the case if the complete binary object was being retrieved, instead of it's text surrogate.

 

Distributed Searching

At the other extreme from centralized searching is distributed searching. This is a very inefficient model. The amount of network traffic that would be generated by all the searchers searching all the Internet sites is staggering. Nor would the processors at most of the sites be capable of supporting the searching load they would see, much less perform the work for which they were actually purchased. In addition, there is the problem of identifying all the sites so that they could be searched. This model of Internet searching can be discounted.

 

Hierarchical Discovery

In between the two extremes is a model where the searcher discovers sites to be searched directly. A simple example of this model was the WAIS Database of Databases, where WAIS sites registered descriptions of their databases at a central site. The searchers started their searches at the Database of Databases and discovered sites at which to repeat the searches. The weakness in this implementation is that the site discovery process was only as good as the site description.

This procedure could be improved by having each site provide a list of all the words from all the text objects and binary surrogates. This process is being called distributed indexing and there are initiatives underway to implement this.

With something like distributed indexing, individual sites can promote their rich site descriptions to regional servers. These regional servers are where the searchers would begin their Internet searches. The results of a search at a regional server would be pointers to sites that contained the records of interest.

Distributed indexing also eases some of the scaling problems of centralized searching. The central directories no longer need to keep copies of all the web pages on the Internet; only the descriptions of the pages as reflected in the indexes.

 

Conclusions

Text based searching of digital object surrogates would appear to be the most viable near-term solution to searching for digital objects. The success of this searching will depend on the development of metadata standards for these digital objects. Alternatives to the current Internet directories are possible and should be investigated.