Building Databases |
There are two main processes that Cheshire3 takes care of:
The Ingestion Phase is the one with which we are currently concerned -- how to build a database with your data.
First we will look at the types of object that will be involved with this phase, then how to configure them. Once the objects in the system are ready there are two possible ways to tell them how to work together and we'll cover each in turn.
Before we can populate a database with data, we need to configure how everything is going to work. So first of all we need to know what sorts of Objects we're going to have to interact with, and hence need to be configured. So we'll run through a fairly typical ingestion phase process in Cheshire3.
A single item of data from outside of the system is represented in Cheshire3 as a Document. This could be raw text, XML, a PDF, a row in a relational table or anything other single coherent item. Normally, you will want to be importing a whole lot of items at once. The first object that we'll need to create is a DocumentGroup. These come in various flavours, but the most commonly used one is the DirectoryDocumentGroup which looks for all of the things that can be imported in a directory tree on disk.
Once we have the documentGroup, we can then extract each document from it in turn for processing. If the document is not in well formed XML, then we need to transform it so the XML parser will accept it. To do this, we use a series of PreParsers. A preParser accepts a document, does a transformation, and returns the result as another document. Examples might be a PDF to Text preParser, or one that turns latin-1 entities like é into their unicode character entity equivalent. A chain of preparsers can then transform a document through any number of intermediate steps before getting to the raw XML form.
When the document is in its final XML, we run it through a Parser in order to create a Record representation of it. This record is the parsed XML, and allows interaction via both the Document Object Model and SAX interfaces, as well as serialising back to the canonical string form.
Before we can talk about the record, we need to assign it a persistent identifier. This is typically done by storing it somewhere in a RecordStore, however they can be manually assigned identifiers as well. Documents may also be stored in a DocumentStore -- especially useful when the preParser chain is 'destructive', eg when information is lost such as in a PDF to raw text conversion.
The record is then given to a Database. The database notes that the record is part of its collection and then hands it off to one or more Indexes to extract the terms to allow for efficient searching.
The index then extracts the relevant data from the record via one or more XPath expressions configured ahead of time. Once it has the result, it gives the information to an Extracter which will extract from it the required data in the required format. We call these 'terms' but they do not have their own object class for performance reasons. For example, if given the contents of a paragraph level element, a KeywordExtracter would extract each word whereas a DateExtracter would look only for dates to extract.
Once the terms have been extracted, the index may run through a series of Normalisers. Each normaliser will take each term and transform it in some fashion. For example a CaseNormaliser might return the term in all lowercase, or a StemNormaliser might perform linguistic stemming.
After the normalisation process, the terms are stored in an IndexStore along with a pointer to the record from which it was derived and the number of occurences of the term in the record.
That's the ingestion phase from beginning to end. We do not need to configure the DocumentGroup (though we do need to create it), the Documents, the Records or the terms, but we will need configuration for the other objects that we've seen so far.
Configuration for objects in Cheshire3 is done in XML. There is one schema that all objects use, but it is extended for ones with additional requirements beyond typical aspects such as settings, defaults, paths and identifiers.
Every object has an identifier which is unique within the context of where it is defined. This is important to understand, as it means that you can have two objects in different databases with the same identifier, but different configurations. It also means you can have one object at the Server level and one at a database level with the same identifier -- in this case the database will use its own object first.
Configurations are nested for this scoping reason. The configuration for a server will contain several database configurations, which in turn will contain the indexes and storage objects required. However it is normal to have each database's configuration in a separate file in the directory tree along with any other associated files, rather than directly in the server configuration itself. The file is then imported via an instruction that we'll get to at the end.
The easiest way to configure a database is to copy and modify an existing configuration file (at least one is supplied with Cheshire3). For detailed information on the configuration file format, see the Configuration section.
Objects to be configured:
Once everything is configured, we need to be able to tell the system to build it. There are two ways of doing this. Either we can configure another object to tell the system which PreParsers to use for this particular database, or we can write some Python code to call the objects directly. Examples of both are supplied in the distribution and explained in the pages below.