|
|
Extracter and Normaliser Configurations
|
Introduction
Extracters locate and extract data of a given format from either a string, a DOM node tree, or a list of SAX events. They must be the first object in an index's workflow. Normalisers are then used to process those terms into a standard form for storing in an index store.
Unless you're using a new extracter or normaliser class, they should all be built by the default server configuration, but for completeness we'll go through the configuration below.
Example
Example extracter and normaliser configurations:
01 <subConfig type="extracter" id="ExactExtracter">
02 <objectType>extracter.SimpleExtracter</objectType>
03 </subConfig>
04
05 <subConfig type="normaliser" id="CaseNormaliser">
06 <objectType>normaliser.CaseNormaliser</objectType>
07 </subConfig>
Explanation
There's obviously not much to say, as these objects only do one thing and don't have a lot of options or paths to set.
Currently available extracters, of which the first four are the most commonly used:
- ExactExtracter: Extract the data exactly as it appears, but without any XML tags.
- KeywordExtracter: Extract keywords from the data.
- ProximityExtracter: Extract keywords from the data, maintaining their relative location. Must be used with a ProximityIndex.
- DateExtracter: Extracts a single date from the data. (Future version will extract multiple dates)
- ExactProximityExtracter: Extract the data exactly, but with proximity maintained for the element, rather than between words in the data.
- ParentProximityExtracter: Extract the data as keywords, but maintain the proximity information relative to the parent element.
- ExactParentProximityExtracter: Extract the data exactly, maintaining the proximity information relative to the parent element.
Currently available normalisers:
- CaseNormaliser: Convert the term to lower case
- PossessiveNormaliser: Remove trailing possessive from the term (eg squirrel's -> squirrel, princesses' -> princesses)
- ArticleNormaliser: Remove leading definite or indefinite article (the fish -> fish)
- PrintableNormaliser: Remove any non-printable characters
- StripperNormaliser: Remove printable punctuation characters: " % # @ ~ ! * { }
- StoplistNormaliser: Remove words from a given stoplist, given in a path of type 'stoplist' (<path type="stoplist">stoplist.txt</path>) The stoplist file should have one word per line.
- DateStringNormaliser: Convert a Date object extracted by DateExtracter into an ISO8601 formatted string
- DiacriticNormaliser: Remove all diacritics from characters. (eg é -> e)
- IntNormaliser: Convert a string into an integer (eg '2' -> 2)
- StringIntNormaliser: Convert an integer into a 0 padded string (eg 2 -> '000000000002')
- EnglishStemNormaliser: Convert an English word into a stemmed form, according to the Porter2 algorithm. (eg Fairy -> fairi) You must have run the possessive normaliser before running this normaliser.
- KeywordNormaliser: Convert an exact extracted string into keywords.
- ProximityNormaliser: Convert an exact extracted string into keywords maintaining proximity information.
- ExactExpansionNormaliser: Sample implementation of an acronym and contraction expanding normaliser. Eg 'XML' -> 'Extensible Markup Language'
- WordExpansionNormaliser: Sample implementation of an acronym expander when dealing with words rather than exact strings. Eg 'XML' -> 'Extensible', 'Markup', 'Language'