Pears Database Functional Description
Background and Scope
This document describes some of the functionality anticipated by the reimplementation of the Newton database building programs. This new system is named Pears, to distinguish it from the apples used for names in the Newton system.
The Pears system is being built to satisfy database requirements within the Office of Research. Specifically, as a replacement for the SMART system used for Scorpion. SMART provides several ranked retrieval algorithms, something that is only available in a limited form in Newton. The license for SMART prohibits its use in commercial applications. Pears is currently being used extensively throughout the Office of Research. It is also being used for some of the databases in the CORC system and the CORC designers are working to replace all their Newton databases with Pears databases. It is hoped that Pears will become part of the SiteSearch product in mid to late 1999.
Pears databases are not usable by the production Newton retrieval engine written in C. They can only be used by the new Java based information retrieval environment (the Light engine) developed by Thom Hickey and Jenny Colvard and derived from the Newton Lite engine developed by Thom Hickey and Bob Haschart. The Light engine has been designed with a dynamic database interface that allows it to search different kinds of databases including traditional Newton databases and Pears databases.
Development Environment Rationale
The Pears database building programs are being written in Java. While Java programs do not necessarily run as fast as native code, their reliability makes the performance penalty worthwhile. In addition, some of the functionality available in Java, such as threads and native Unicode support, will make it possible to do things in Java that weren’t feasible in C or C++. (As of 8/12/98, the Pears code loaded USMARC records at one quarter the speed of the C code on a Solaris machine. Multithreading will definitely reduce that speed differential.)
One feature of Java of particular interest is its support for dynamically loaded routines. Most other languages support dynamically loaded code, C among them. The SiteSearch developers have resisted making this capability available to their customers because of the potential unreliability that can result from allowing customers to introduce their own code into production programs. The inherent reliability of the Java environment significantly reduces the potential impact of unreliable customer code and the new SiteSearch Java development makes extensive use of dynamic code. The Pears database building programs will also provide significant opportunities for users to customize the database building programs with their own code. (As of 4/22/99, this means indexing, record conversion, record filtering and characterset conversion routines.)
Database Architecture
Operationally, Pears databases are architecturally identical to Newton databases in that they still logically consist of Header, Index and Postings (hedr, indx and post) files with Header and Postings Indirection (hdir and pdir) files. However, these logical files have been combined into one physical file and the hdir and pdir files have been combined into one logical Indirection (idir) file. The file still consists of a sequence of fixed length physical regions accessed by relative region number. (This is the BDAM file architecture from OS/MVS.) The differences lie in the contents of those files.
Internally, about the only things common between Newton and Pears databases are the BER records in the Header file. Postings files still have postings lists and Index files still have B-trees, but they differ from their Newton equivalents in structure.
Functionality and Features
Functionally, Pears databases will be able to do everything that Newton databases do, with the exception of supporting gigantic records through automatic decomposition and the creation of "boot" records. I am not thrilled with that model for handling large records and I am considering a model based on some of the collection management functionality being discussed in the Z39.50 implementor community.
Pears databases will provide significant new features. In no particular order, they are:
- Ranked Retrieval. Pears databases will keep track of the frequency of occurrence of terms in documents as well as within the database and will allow documents to be ranked based on the significance of the user’s search terms within the documents retrieved. Several ranking algorithms are currently used within the SMART system and will need to be supported, at least initially, by the Pears databases. To support this and to enable experimentation with a number of ranking algorithms, the ranking routines will be dynamically loadable. (Done, as of 4/22/99.)
- Online Update. We believe we have found a mechanism for supporting efficient online update. This mechanism has been implemented in Newton within the Office of Research and seems to be working well. The code is in place in Pears, but has not seen extensive use yet.
- The five Newton database files combined into one Pears database file. (Done, as of 8/12/98.)
- The five Newton database update programs/steps (initdb, record conversion, pippin, sortnip and rome) combined into a single program (Bartlett). (Done, as of 8/12/98.)
- Fails Safe. A Pears database is intact, even if Bartlett ends abnormally. (Done, as of 4/22/99.)
- Embedability. Routines have been provided to allow applications to add, replace or remove single records from a Pears database. (As opposed to using Bartlett in batch mode. Done, as of 4/22/99.)
- Internationalization. Records in the Unicode characterset (the international standard for supporting multiple character sets) can be loaded and indexed. All index terms are effectively, but not literally, stored in Unicode. (As of 4/22/99, USMARC to Unicode and OCLC-ASCII to Unicode record converters have been written.)
- Automatic Database Growth. The BDAM model implemented in the Newton databases forced them to be preconfigured at a specific size and made it awkward to make them bigger when they filled up. The new databases will only use as much disk space as they need and can grow dynamically. It will still be possible to specify a maximum size for a database file. (As of 4/22/99, the automatic database growth has been implemented, but the size limit has not.)
- Dynamic Record Conversion. It will be possible to specify in the database description the names of routines that can convert records from their native format to BER. These routines will be used by Bartlett (the Pears database update program) to read the records in their native format; thus eliminating the BER conversion step currently needed to build databases. (Done, as of 8/12/98.)
- Record Filters. During record conversion, a dynamically loaded filter can eliminate records from being loaded. This technique can be used to partition databases or to extract particular records from a data stream. (Done, as of 4/22/99.)
- Dynamic Indexing Rules. Customers will be able to provide their own code for indexing fields in records. (Done, as of 8/12/98.)
- Non-tagpath Indexing Rules. Indexing rules can be invoked independent of record structure. They can be used to clean up indexing from other routines (stopword inforcement, for instance) or to browse entire records for indexes that do not have an obvious field dependency. (Done, as of 4/22/99.)
- Reindexing. Entire indexes can be removed and records already stored in the database can be reindexed. (Done, as of 8/25/98.)
- Context-free Tag Paths. Indexing rules for Newton databases require that the database designer provide complete paths to fields in records. Unfortunately, SGML records usually have so much potential variability in internal structure that a complete list of all the places that a field might occur in a record is not feasible. Pears users will be able to provide tag paths with wildcard characters. (Done, as of 8/12/98.)
- Unlimited Number of Indexes. The current limit of 255 indexes is removed. (Done, as of 8/12/98.)
- . Ini Files For Database Descriptions. The old Newton database description language has been dropped in favor of a Windows-style initialization file. (As of 8/12/98, all implemented configurable features are specified through a .ini file.)
- Records can be deleted from the database by specifying a file containing a list of Accession Numbers (an indexed field in the records) or Record ID's (generated internally by the database) for the records to be deleted.
- Index Sorted Primarily By Index ID. Newton indexes are sorted primarily by index term and secondarily by the term's index. This allows terms in other indexes to be viewed when browsing indexes. This browsing feature was little used and resulted in terms from sparse indexes being widely scattered in the index; making it hard to browse those indexes. (Done, as of 8/12/98.)
- Left-hand Truncation. (Done, as of 8/23/99.)
Significant Newton Functionality and Features
It is probably worthwhile to list those important features of Newton that are being carried forward into Pears. Unless otherwise noted, they are already implemented in Pears. In no particular order, they are:
- Universal Database Portability. A Pears database can be used on any machine, regardless of the machine that it was built on.
- BER Records As Internal Storage Format. Arbitrarily large records with arbitrarily complex hierarchical structure containing both text and binary data are supported in BER records (ISO Standard 8824).
- Two Billion Records Per Database. Using partitioning at the Z39.50 layer, logical databases can be constructed of multiple physical databases, allowing the logical database to be arbitrarily large.
- Two Billion Postings Per Term.
- Practically unlimited term length. Terms with lengths greater than two thousand characters are discouraged.
- Record Restrictors. These can be used to significantly speed up some kinds of searches. They allow information about a record (e.g., its date or language) to be associated with all terms extracted from that record. The extraction of these restrictor values will be implemented as dynamically loadable routines.
- Proximity Information. The position of a term in a record can be used to ask questions about that term's proximity to other terms. Pears will support word and field proximity and will probably support dynamically loadable routines to generate other per-occurrence data.
- Delete and Replace Records.
- Stopwords. In Newton databases, the stopwords were applied to all indexes. Pears supports index-specific stopwords.
Miscellaneous Technical Improvements
In addition, it is probably worthwhile to list those improvements in the Pears system that don't result in user apparent features but enhance the functionality of the system. In no particular order, they are:
- Binary Searchable Index Nodes. In Newton, index nodes had to be searched sequentially. The Pears system supports a directory on each node that allows the use of a binary search routine. This directory also allows terms to be added more easily, as they can be added at the beginning of the freespace area, rather than having to be inserted alphabetically into a list of terms. (Done, as of 8/12/98.)
- Etcetera Regions Eliminated. In Newton, postings lists are broken into fragments and a map of the fragments is maintained in the index. This map had a maximum length and some extremely long postings lists had fragments not covered by the map (these were called etcetera regions.) Arbitrarily long maps can now be created and stored in the postings file. (Done, as of 8/12/98.)
- Short Postings Lists Kept In The Index. In a Newton database, the postings data for singly posted terms was kept in the index and all other postings lists went into the postings file. In a Pears database, longer lists can be kept in the index and the maximum length of those lists is configurable. (As of 8/12/98, this is complete, except for the configurability of the length.)