Pears Database Functional Description


Background and Scope

This document describes some of the functionality anticipated by the reimplementation of the Newton database building programs. This new system is named Pears, to distinguish it from the apples used for names in the Newton system.

The Pears system is being built to satisfy database requirements within the Office of Research. Specifically, as a replacement for the SMART system used for Scorpion. SMART provides several ranked retrieval algorithms, something that is only available in a limited form in Newton. The license for SMART prohibits its use in commercial applications. Pears is currently being used extensively throughout the Office of Research. It is also being used for some of the databases in the CORC system and the CORC designers are working to replace all their Newton databases with Pears databases. It is hoped that Pears will become part of the SiteSearch product in mid to late 1999.

Pears databases are not usable by the production Newton retrieval engine written in C. They can only be used by the new Java based information retrieval environment (the Light engine) developed by Thom Hickey and Jenny Colvard and derived from the Newton Lite engine developed by Thom Hickey and Bob Haschart. The Light engine has been designed with a dynamic database interface that allows it to search different kinds of databases including traditional Newton databases and Pears databases.


Development Environment Rationale

The Pears database building programs are being written in Java. While Java programs do not necessarily run as fast as native code, their reliability makes the performance penalty worthwhile. In addition, some of the functionality available in Java, such as threads and native Unicode support, will make it possible to do things in Java that werenít feasible in C or C++. (As of 8/12/98, the Pears code loaded USMARC records at one quarter the speed of the C code on a Solaris machine. Multithreading will definitely reduce that speed differential.)

One feature of Java of particular interest is its support for dynamically loaded routines. Most other languages support dynamically loaded code, C among them. The SiteSearch developers have resisted making this capability available to their customers because of the potential unreliability that can result from allowing customers to introduce their own code into production programs. The inherent reliability of the Java environment significantly reduces the potential impact of unreliable customer code and the new SiteSearch Java development makes extensive use of dynamic code. The Pears database building programs will also provide significant opportunities for users to customize the database building programs with their own code. (As of 4/22/99, this means indexing, record conversion, record filtering and characterset conversion routines.)


Database Architecture

Operationally, Pears databases are architecturally identical to Newton databases in that they still logically consist of Header, Index and Postings (hedr, indx and post) files with Header and Postings Indirection (hdir and pdir) files. However, these logical files have been combined into one physical file and the hdir and pdir files have been combined into one logical Indirection (idir) file. The file still consists of a sequence of fixed length physical regions accessed by relative region number. (This is the BDAM file architecture from OS/MVS.) The differences lie in the contents of those files.

Internally, about the only things common between Newton and Pears databases are the BER records in the Header file. Postings files still have postings lists and Index files still have B-trees, but they differ from their Newton equivalents in structure.


Functionality and Features

Functionally, Pears databases will be able to do everything that Newton databases do, with the exception of supporting gigantic records through automatic decomposition and the creation of "boot" records. I am not thrilled with that model for handling large records and I am considering a model based on some of the collection management functionality being discussed in the Z39.50 implementor community.

Pears databases will provide significant new features. In no particular order, they are:

Significant Newton Functionality and Features

It is probably worthwhile to list those important features of Newton that are being carried forward into Pears. Unless otherwise noted, they are already implemented in Pears. In no particular order, they are:

Miscellaneous Technical Improvements

In addition, it is probably worthwhile to list those improvements in the Pears system that don't result in user apparent features but enhance the functionality of the system. In no particular order, they are: