Build Script

Introduction

A sample and fairly straightforwards script to build a database from a single file containing XML documents. We go through it section by section and explain how things work. Stylistically, the python code itself could be slightly improved, but is easy to understand. It can be used as a template for other scripts, or as a base point for more complicated versions.

Python Environment (01-11)

01 #!/home/cheshire/install/python
02
03 import sys
04 
05 osp = sys.path
06 sys.path = ["/home/cheshire/cheshire3/code"]
07 sys.path.extend(osp)
08
09 from baseObjects import Session
10 from server import SimpleServer
11 from documentGroup import BigFileDocumentGroup

The first thing to do in any script is to setup python such that you can use the various Cheshire3 objects. This allows us to find the Cheshire3 code first, before any other similarly named modules that might be installed. Lines 10 and 11 import the two Cheshire3 classes that we use directly.

Cheshire3 Environment (13-22)

13 # Build environment...
14 session = Session()
15 serv = SimpleServer(session, "../configs/serverConfig.xml")
16 db = serv.get_object(session, 'db_tei')
17 recStore = db.get_object(session, 'TeiRecordStore')
18 sax = db.get_object(session, 'TeiParser')
19 
20 dg = BigFileDocumentGroup("tei_files.xml")
21 total = dg.get_length(session)
22

Next we need to set up the Cheshire3 environment which has been configured. The server is built (line 15) by giving it the path to a configuration file. From there, other objects such as the database and recordStore (16-18) to be used are extracted by their identifier.

In order to store and index records, we need to have them in a processable form. Line 20 creates a DocumentGroup from a file named 'tei_files.xml' which contains 'total' number of records (line 21)

Load and Index (23-34)

23 db.begin_indexing(session)
24 recStore.begin_storing(session)
25 for d in range(total):
26     doc = dg.get_document(session, d)
27    try:
28         rec = lx.process_document(session, doc)
29     except:
30         print doc.get_raw()
31         sys.exit()
32     id = recStore.create_record(session, rec)
33     db.add_record(session, rec)
34     db.index_record(session, rec)

First (line 23) we need to tell the database that we're going to be indexing a lot of information. This lets the system handle all of the loading in one go at the end (line 36) and store only temporary information until then. Linewise line 24 tells the record store that it's going to be getting a lot of information coming in, and is closed at line 35

Then we step through each record (25) extracting it from the documentGroup (26). Parsing (28) the record from the raw XML should always happen in a try: (27) block so that if the XML isn't well formed, you can do something sensible with it. The 'sensible' thing in this case is to print it to the screen and then exit the script (30-31)

Once we have a record, we need to store it in the recordStore (line 32). Then we add it to the database (33) [recall that records may be in more than one database] and then index it (34).

Cleanup (35-37)

35 recStore.commit_storing(session)
36 db.commit_metadata(session)
37 db.commit_indexing(session)

Because we're not going to add any more records, we can close the recordStore (line 35). This ensures that any records are flushed to disk, rather than being kept in memory. We also need to commit the metadata about the database (such as the newly added records) to disk and then finally we commit the indexing (line 37).

Complete Example

#!/home/cheshire/install/bin/python
import sys

# Add cheshire to our system path
osp = sys.path
sys.path = ["/home/cheshire/c3/cheshire3/code"]
sys.path.extend(osp)

from baseObjects import Session
from server import SimpleServer
from documentGroup import BigFileDocumentGroup
# Build environment...
session = Session()
serv = SimpleServer(session, "../configs/serverConfig.xml")
db = serv.get_object(session, 'db_tei')
recStore = db.get_object(session, 'TeiRecordStore')
sax = db.get_object(session, 'TeiParser')

dg = BigFileDocumentGroup("tei_files.xml")
total = dg.get_length(session)

recStore.begin_storing(session)
db.begin_indexing(session)
for d in range(total):
    doc = dg.get_document(session, d)
    try:
        rec = lx.process_document(session, doc)
    except:
        print doc.get_raw()
        sys.exit()
    id = recStore.create_record(session, rec)
    db.add_record(session, rec)
    db.index_record(session, rec)
recStore.commit_storing(session)
db.commit_metadata(session)
db.commit_indexing(session)