Extracter Configuration

Introduction

Extracters locate and extract data of a given format from either a string, a DOM node tree, or a list of SAX events. They must be the first object in an index's workflow. Normalisers are then used to process those terms into a standard form for storing in an index store.

Unless you're using a new extracter or normaliser class, they should all be built by the default server configuration, but for completeness we'll go through the configuration below.

Example

Example extracter and normaliser configurations:

01 <subConfig type="extracter" id="ExactExtracter">
02   <objectType>extracter.SimpleExtracter</objectType>
03 </subConfig>
04 
05 <subConfig type="normaliser" id="CaseNormaliser">
06   <objectType>normaliser.CaseNormaliser</objectType>
07 </subConfig>

Explanation

There's obviously not much to say, as these objects only do one thing and don't have a lot of options or paths to set.

Currently available extracters, of which the first four are the most commonly used:

ExactExtracter: Extract the data exactly as it appears, but without any XML tags.
KeywordExtracter: Extract keywords from the data.
ProximityExtracter: Extract keywords from the data, maintaining their relative location. Must be used with a ProximityIndex.
DateExtracter: Extracts a single date from the data. (Future version will extract multiple dates)
ExactProximityExtracter: Extract the data exactly, but with proximity maintained for the element, rather than between words in the data.
ParentProximityExtracter: Extract the data as keywords, but maintain the proximity information relative to the parent element.
ExactParentProximityExtracter: Extract the data exactly, maintaining the proximity information relative to the parent element.

Currently available normalisers:

CaseNormaliser: Convert the term to lower case
PossessiveNormaliser: Remove trailing possessive from the term (eg squirrel's -> squirrel, princesses' -> princesses)
ArticleNormaliser: Remove leading definite or indefinite article (the fish -> fish)
PrintableNormaliser: Remove any non-printable characters
StripperNormaliser: Remove printable punctuation characters: " % # @ ~ ! * { }
StoplistNormaliser: Remove words from a given stoplist, given in a path of type 'stoplist' (<path type="stoplist">stoplist.txt</path>) The stoplist file should have one word per line.
DateStringNormaliser: Convert a Date object extracted by DateExtracter into an ISO8601 formatted string
DiacriticNormaliser: Remove all diacritics from characters. (eg é -> e)
IntNormaliser: Convert a string into an integer (eg '2' -> 2)
StringIntNormaliser: Convert an integer into a 0 padded string (eg 2 -> '000000000002')
EnglishStemNormaliser: Convert an English word into a stemmed form, according to the Porter2 algorithm. (eg Fairy -> fairi) You must have run the possessive normaliser before running this normaliser.
KeywordNormaliser: Convert an exact extracted string into keywords.
ProximityNormaliser: Convert an exact extracted string into keywords maintaining proximity information.
ExactExpansionNormaliser: Sample implementation of an acronym and contraction expanding normaliser. Eg 'XML' -> 'Extensible Markup Language'
WordExpansionNormaliser: Sample implementation of an acronym expander when dealing with words rather than exact strings. Eg 'XML' -> 'Extensible', 'Markup', 'Language'

Extracter and Normaliser Configurations