Workflow Configuration |
Workflows are first class objects in the Cheshire3 system -- they're configured at the same time and in the same way as other objects. Their function is to make a series of steps to take easily definable and portable between systems, as opposed to writing customised code to achieve the same end result.
Build workflows are the most common type as the data must generally pass through a lot of different functions on different objects, however as explained previously the differences between databases are often only in one section. By using workflows, we can simply define the changed section rather than writing code to do the same task over and over again.
The disadvantage, currently, of workflows is that it is very complicated to find out what is going wrong if something fails. If your data is very clean, then a workflow is probably the right solution, however if the data is likely to have XML parse errors or has to go through many different preParsers and you want to verify each step, then hand written code may be a better solution for you.
The distribution comes with a generic build workflow object called 'buildIndexWorkflow'. It then calls 'buildIndexSingleWorkflow' to handle each individual document, also supplied. This second workflow then calls 'PreParserWorkflow', of which a trivial one is supplied, but this is very unlikely to suit your particular needs, and should be customised as required.
Simple workflow configuration:
01 <subConfig id="PreParserWorkflow"> 02 <objectType>workflow.SimpleWorkflow</objectType> 03 <workflow> 04 <!-- input type: document --> 05 <object type="preParser" ref="SgmlPreParser"/> 06 <object type="preParser" ref="CharacterEntityPreParser"/> 07 </workflow> 08 </subConfig>
Slightly more complex workflow configuration:
01 <subConfig id="buildIndexWorkflow"> 02 <objectType>workflow.SimpleWorkflow</objectType> 03 <workflow> 04 <!-- input type: documentGroup --> 05 <log>Loading records</log> 06 <object type="recordStore" function="begin_storing"/> 07 <object type="database" function="begin_indexing"/> 08 <for-each> 09 <object type="workflow" ref="buildIndexSingleWorkflow"/> 10 </for-each> 11 <object type="recordStore" function="commit_storing"/> 12 <object type="database" function="commit_metadata"/> 13 <object type="database" function="commit_indexing"/> 14 </workflow> 15 </subConfig>
The first two lines of each configuration are exactly the same as all previous objects. Then there is one new section -- workflow. This contains a series of instructions for what to do, primarily by listing objects to handle the data.
The first workflow is an example of how to override the PreParserWorkflow for a specific database. In this case we start by giving the document input object to the SgmlPreParser in line 5, and the result of that is given to the CharacterEntityPreParser in line 6. Note that line 4 is just a comment and not required.
The second example is slightly more complex with some additional constructions. Line 5 uses the log instruction to get the workflow to log the fact that it is starting to load records.
In lines 6 and 7 the object tags have a second attribute called 'function'. This contains the name of the function to call when it's not derivable from the input object. For example, a preParser will always call 'process_document', however you need to specify the function to call on a database as there are many available. Note also that there isn't a 'ref' attribute to reference a specific object identifier. In this case it uses the current session to determine which server, database, recordStore and so forth should be used. This allows the workflow to be used in multiple contexts.
The for-each block then iterates through the documents in the supplied documentGroup, calling the buildIndexSingle workflow on each of them.
Finally the database and recordStore have their commit functions called to ensure that everything is written out to disk.