Cheshire3 Configuration: Cluster Extraction

Introduction

A cluster record consists of one controlled vocabulary field along with the data from other fields across all records which have that controlled field. This is generally used to allow relevance ranked keyword searches on description or scope/content fields, but rather than retrieving the records which match, to find the most appropriate subject to do a subsequent search for.

For example, given the two dublin core records:

<record>
  <dc:description>this is a document about headache pain</dc:description>
  <dc:subject>Orofacial Pain</dc:subject>
</record>

<record>
  <dc:description>Drugs and their uses in modern headache treatment</dc:description>
  <dc:subject>Orofacial Pain</dc:subject>
</record>
Might create through the use of a ClusterDocumentGroup:
<cluster>
  <key>orofacial pain</key>
  <description> this is a document about headache pain 
                 Drugs and their uses in modern headache treatment
  </description>
</cluster>

And a search for 'headache drugs pain' would thus return the subject 'Orofacial Pain'.

Example

<subConfig type="transformer" id="clusterExtractionTxr">
    <objectType>transformer.ClusterExtractionTransformer</objectType>
    <paths>
      <path type="tempPath">tempCluster.data</path>
      <object type="database" ref="db_cluster"/>
    </paths>
    <cluster>
      <map type="key">
        <xpath>datafield[@tag='640']</xpath>
        <xpath>key</xpath>
        <process>
            <object type="extracter" ref="ExactExtracter"/>
            <object type="normaliser" ref="CaseNormaliser"/>
          </process>
        </map>
        <map>
        <xpath>datafield[@tag='245']</xpath>
        <xpath>title</xpath>
        <process>
          <object type="extracter" ref="ExactExtracter"/>
          <object type="normaliser" ref="CaseNormaliser"/>
        </process>
      </map>
      <map>
        <xpath>datafield[@tag='500']</xpath>
        <xpath>description</xpath>
        <process>
          <object type="extracter" ref="ExactExtracter"/>
          <object type="normaliser" ref="CaseNormaliser"/>
        </process>
      </map>
    </cluster>
  </subConfig>

<cluster>

A wrapper element around the mappings to be done.

<map>

Each cluster extracter must have at least two maps. A map contains two XPaths and a process chain (as per indexing) and configures one particular extraction operation.

<xpath>

The first xpath is where to find the data in the record to extract, the second is where to put it in the cluster file. The first map should be the key to use (subject field in the example) and must have 'key' as the name of the element to create.

<process>

This is the same as the process chain for indexing, except that only extracters and normalisers are regularly used. For more information, see index configuration.