![]() |
![]() |
Build Script |
A sample and fairly straightforwards script to build a database from a single file containing XML documents. We go through it section by section and explain how things work. Stylistically, the python code itself could be slightly improved, but is easy to understand. It can be used as a template for other scripts, or as a base point for more complicated versions.
01 #!/home/cheshire/install/python 02 03 import sys 04 05 osp = sys.path 06 sys.path = ["/home/cheshire/cheshire3/code"] 07 sys.path.extend(osp) 08 09 from baseObjects import Session 10 from server import SimpleServer 11 from documentGroup import BigFileDocumentGroup |
The first thing to do in any script is to setup python such that you can use the various Cheshire3 objects. This allows us to find the Cheshire3 code first, before any other similarly named modules that might be installed. Lines 10 and 11 import the two Cheshire3 classes that we use directly.
13 # Build environment... 14 session = Session() 15 serv = SimpleServer(session, "../configs/serverConfig.xml") 16 db = serv.get_object(session, 'db_tei') 17 recStore = db.get_object(session, 'TeiRecordStore') 18 sax = db.get_object(session, 'TeiParser') 19 20 dg = BigFileDocumentGroup("tei_files.xml") 21 total = dg.get_length(session) 22 |
Next we need to set up the Cheshire3 environment which has been configured. The server is built (line 15) by giving it the path to a configuration file. From there, other objects such as the database and recordStore (16-18) to be used are extracted by their identifier.
In order to store and index records, we need to have them in a processable form. Line 20 creates a DocumentGroup from a file named 'tei_files.xml' which contains 'total' number of records (line 21)
23 db.begin_indexing(session) 24 recStore.begin_storing(session) 25 for d in range(total): 26 doc = dg.get_document(session, d) 27 try: 28 rec = lx.process_document(session, doc) 29 except: 30 print doc.get_raw() 31 sys.exit() 32 id = recStore.create_record(session, rec) 33 db.add_record(session, rec) 34 db.index_record(session, rec) |
First (line 23) we need to tell the database that we're going to be indexing a lot of information. This lets the system handle all of the loading in one go at the end (line 36) and store only temporary information until then. Linewise line 24 tells the record store that it's going to be getting a lot of information coming in, and is closed at line 35
Then we step through each record (25) extracting it from the documentGroup (26). Parsing (28) the record from the raw XML should always happen in a try: (27) block so that if the XML isn't well formed, you can do something sensible with it. The 'sensible' thing in this case is to print it to the screen and then exit the script (30-31)
Once we have a record, we need to store it in the recordStore (line 32). Then we add it to the database (33) [recall that records may be in more than one database] and then index it (34).
35 recStore.commit_storing(session) 36 db.commit_metadata(session) 37 db.commit_indexing(session) |
Because we're not going to add any more records, we can close the recordStore (line 35). This ensures that any records are flushed to disk, rather than being kept in memory. We also need to commit the metadata about the database (such as the newly added records) to disk and then finally we commit the indexing (line 37).
#!/home/cheshire/install/bin/python import sys # Add cheshire to our system path osp = sys.path sys.path = ["/home/cheshire/c3/cheshire3/code"] sys.path.extend(osp) from baseObjects import Session from server import SimpleServer from documentGroup import BigFileDocumentGroup # Build environment... session = Session() serv = SimpleServer(session, "../configs/serverConfig.xml") db = serv.get_object(session, 'db_tei') recStore = db.get_object(session, 'TeiRecordStore') sax = db.get_object(session, 'TeiParser') dg = BigFileDocumentGroup("tei_files.xml") total = dg.get_length(session) recStore.begin_storing(session) db.begin_indexing(session) for d in range(total): doc = dg.get_document(session, d) try: rec = lx.process_document(session, doc) except: print doc.get_raw() sys.exit() id = recStore.create_record(session, rec) db.add_record(session, rec) db.index_record(session, rec) recStore.commit_storing(session) db.commit_metadata(session) db.commit_indexing(session) |