Index Configuration |
Indexes are the primary means of locating records in the system, and hence need to be well thought out and specified in advance. They consist of one or more paths to tags in the record, and how to process the data once it has been located.
Example index configurations:
01 <subConfig id = "xtitle-idx"> 02 <objectType>index.SimpleIndex</objectType> 03 <paths> 04 <object type="indexStore" ref="eadIndexStore"/> 05 </paths> 06 <source> 07 <xpath>/ead/eadheader/filedesc/titlestmt/titleproper</xpath> 08 <process> 09 <object type="extracter" ref="ExactExtracter"/> 10 <object type="normaliser" ref="CaseNormaliser"/> 11 </process> 12 </source> 13 <options> 14 <setting type="sortStore">true</setting> 15 </options> 16 </subConfig> 17 18 <subConfig id = "stemtitleword-idx"> 19 <objectType>index.ProximityIndex</objectType> 20 <paths> 21 <object type="indexStore" ref="eadIndexStore"/> 22 </paths> 23 <source> 24 <xpath>titleproper</xpath> 25 <process> 26 <object type="extracter" ref="ProximityExtracter"/> 27 <object type="normaliser" ref="CaseNormaliser"/> 28 <object type="normaliser" ref="PossessiveNormaliser"/> 29 <object type="normaliser" ref="EnglishStemNormaliser"/> 30 </process> 31 </source> 32 </subConfig>
Lines 1 and 2, 18 and 19 should be second nature by now. Line 4 and the same in line 21 are a reference to the indexStore in which the index will be maintained.
This brings us to the source section starting in line 6. It must contain one or more xpath elements. These XPaths will be evaluated against the record to find a node, nodeSet or attribute value. This is the base data that will be indexed after some processing. In the first case, we give the full path, but in the second only the final element. In Cheshire3, it is generally most efficient to give as small a path as possible to identify exactly which elements you want to index, so the path at line 24 is cheaper than the path at line 7.
If the records contain XML Namespaces, then there are two approaches available. If the element names are unique between all the namespaces in the document, you can simply omit them. For example /srw:record/dc:title could be written as just /record/title. The alternative is to define the meanings of 'srw' and 'dc' on the xpath element in the normal xmlns fashion.
After the XPath(s), we need to tell the system how to process the data that gets pulled out. This happens in the process section, and is a list of objects to sequentially feed the data through. The first object must be an extracter, and generally the others are normalisers (though there are times when you might want to put in other types of processing object as well). The first index uses the ExactExtracter to pull out the text as it appears exactly as a single term. The second uses ProximityExtracter to pull out the keywords along with their position in the field to allow for phrase searching.
Both indexes then send the extracted terms to a CaseNormaliser, which will reduce all characters to lowercase. The second index then gives the lowercase terms to a PossessiveNormaliser to strip off 's and s' from the end, and then to EnglishStemNormaliser to apply linguistic stemming.
After these processes have happened, the system will store the transformed terms in the indexStore referenced in the paths section.
Finally, in the first example, we have a setting called 'sortStore'. If this is given, then the system will create a map of record to term for the index to allow it to be quickly retrieved for the purposes of sorting.