Introduction
PreParsers are used to convert documents when they are being introduced to the system into a form in which they can be most easily processed.
They typically only do one thing, and as such do not have extensive configuration sections.
Example
Example preParser configurations:
01 <subConfig type="preParser" id="SgmlPreParser">
02 <objectType>extracter.SgmlPreParser</objectType>
03 <options>
04 <setting type="emptyElements">lb ptr extptr hr<setting>
05 </options>
06 </subConfig>
07
08 <subConfig type="preParser" id="CharacterEntityPreParser">
09 <objectType>extracter.CharacterEntityPreParser</objectType>
10 </subConfig>
Explanation
There's obviously not much to say, as these objects only do one thing and don't have a lot of options or paths to set. The first example is one of the only ones that does, and has a list of empty SGML elements to be converted to empty XML elements (eg <hr> -> <hr/>)
Currently available PreParsers:
- SgmlPreParser: Convert SGML into XML (lowercase element names, quote all attributes, fix empty tags)
- PrintableOnlyPreParser: Remove any non printable characters
- PdfToTxtPreParser: Convert PDF into raw text format using pdftotext utility
- TxtToXmlPreParser: Wrap raw text in some simple xml tags.
- CharacterEntityPreParser: Turn latin-1 entities into XML character entities. (eg – -> –)
- MarcToXmlPreParser: Convert MARC records into MARCXML
- MarcToSgmlPreParser: Convert MARC records into MARCSGML (Cheshire2 format)
- MontyXmlPreParser: Tag parts of speech in raw text file as XML. (See additional licence in code/monty)
- MontyTxtPreParser: Tag parts of speech in raw text file as text.
- GzipPreParser: Uncompress a gzipped file
- BzipPreParser: Uncompress a bzipped file
- B64EncodePreParser, B64DecodePreParser: Encode or Decode a file using base 64.
- WordPreParser: Sample implementation. Using an OpenOffice server, turn any format that OOo recognises into OpenDocument XML.