Metadata Schema Transformation Services


Many researchers in the library community recognize the need to lower the barriers to the management of digital resources by implementing some measure of interoperability among metadata standards. They have proposed a wide range of solutions, including crosswalks, translation algorithms, metadata registries, and specialized data dictionaries. Yet despite some genuine advances, it is still difficult to identify the common elements in metadata standards and put this information to use in systems that resolve differences between incompatible records. The unfortunate result is a growing backlog of unassimilated resources that are all but inaccessible to users of digital information.

The goal of our research project is to develop data models and software tools that ease the task of translating between metadata standards, bridging the gap between analysis and implementation. Our work focuses on technical implementations of the crosswalk, the object that distinguishes metadata translation from more routine types of data and format conversion.

Two paths to interoperable metadata

Crosswalks are typically presented as tables of equivalences between two standards, such as MARC 245 $a and Dublin Core Title or ONIX and MARC 100 $a. Though the equivalences may be inexact, they represent an expert's judgment that the differences are immaterial to the successful operation of a software process that involves records encoded in the two standards. Unfortunately, this critical information isn't immediately usable in its native format because a crosswalk table is incomplete. It must be interpreted as a component in a complex object that makes reference to the formal specification of at least two metadata standards, where abstract concepts such as author, title, subject, and copyright are defined and given structural realization.

We have developed two implementations of the crosswalk, shown schematically in the figure below.

The first, called the Short Translation Path, translates a record from the native source format to the native target format with an XSLT stylesheet, moving directly from Step 1 to Step 5. But stylesheets are an incomplete representation of the crosswalk because they contain only structural information. To interpret as well as execute crosswalks requires six resources: the table of equivalences, the source metadata format, and the target metatata format, each of which may have a human-readable as well as a machine-processable encoding. In our proposal, the crosswalk is modelled as a METS object that collects pointers to these materials in a rich record that has many indexable fields. A database of METS crosswalk objects is then embedded in a repository or record delivery system that features lightweight XML processing.

In a typical process, the system examines an XML record to identify its format and match it against available translation options returned from a search of the METS database. The results of this operation are used to create RSS feeds, brief records, or record dumps in the user's desired format. The human-readable resources in the METS record serve to verify the meaning of the translation and aid in the identification of similar records. The Short Path works well when the standards involved in the crosswalks are relatively stable and translation options are limited, but it puts the burden on the systems designer who must assemble and maintain the supporting files.

Our second proposal for formalizing the crosswalk is the Long Translation Path, represented by the additional steps 2 - 4 in the figure. In this model, the division between syntax and semantics is more strictly enforced. As a result, the structure of the input record is first normalized to a generic XML container in step 2. The translation logic encoded in the crosswalk is applied at Step 3, producing a syntactically normalized translation in Step 4, which is converted to the native output format in Step 5. Because low-level syntactic details such as element order and internal data formatting have been pushed to the outer layers of the model, the semantic translation layer at Step 3 emerges as the centerpiece.

The translation logic is executed by a dedicated XML application called the Semantic Equivalence Expression Language, or Seel, which transcribes the information in a crosswalk into an executable format. Like a crosswalk, a Seel script is organized as sequence of independent modules, or maps, each implementing a single line in a translation. Because of the close correspondence between a Seel script and a crosswalk, Seel code can be automatically generated from structured tabular input such as an Excel spreadsheet. And because of the modular design of the Long Path, each step of the processing model is optimized for reuse.

For example, the same Seel script can be used for multiple syntactic encodings such as RDF, XML, or structured non-XML. New versions of standards can be implemented by assembling a translation that consists of new maps introduced by the revision, plus the maps from the previous version that were unchanged. In future work, the Seel map will be used as a template for an entry in a database that can be searched and mined. The long path model promises to speed the transition from analysis to application, while minimizing the impact on the metadata experts who create the crosswalks. The defining object is still abstract, but behind the scenes the model supplies the messy details required to execute the translation.

In sum, our research project has developed two computational models for metadata crosswalks. Though this outcome reflects a natural evolution of our thinking, we believe that the models address different needs. The short path works for the best-case scenario where the data is in XML and the number of crosswalks is constrained. The long path is optimized for scalability and change management. Both models promote reuse and reduce ad-hoc processing, thereby advancing the goal of interoperability in digital libraries.


The Short Translation Path


Bibliography on metadata standards and interoperability