strusAnalyzer  0.17
Public Types | Public Member Functions | Static Public Member Functions | List of all members
strus::TextProcessorInterface Class Referenceabstract

Interface for the object providing tokenizers and normalizers used for creating terms from segments of text and functions for collecting overall document statistics. More...

#include <textProcessorInterface.hpp>

Public Types

enum  FunctionType {
  Segmenter, TokenizerFunction, NormalizerFunction, AggregatorFunction,
  PatternLexer, PatternMatcher
}
 Function type for fetching descriptions of available functions. More...
 

Public Member Functions

virtual ~TextProcessorInterface ()
 Desructor. More...
 
virtual std::string getResourceFilePath (const std::string &filename) const =0
 Get the absolute path of a resource file. More...
 
virtual const SegmenterInterfacegetSegmenterByName (const std::string &segmenterName) const =0
 Get a document segmenter object reference. More...
 
virtual const SegmenterInterfacegetSegmenterByMimeType (const std::string &mimetype) const =0
 Get a document segmenter object reference that is able to process the specified MIME type. More...
 
virtual analyzer::SegmenterOptions getSegmenterOptions (const std::string &scheme) const =0
 Get the options for a document segmenter for a specific document type. More...
 
virtual const
TokenizerFunctionInterface
getTokenizer (const std::string &name) const =0
 Get a const reference to a tokenizer object that implements the splitting of a text segments into tokens. More...
 
virtual const
NormalizerFunctionInterface
getNormalizer (const std::string &name) const =0
 Get a const reference to a normalizer object that implements the transformation of a token into a term string. More...
 
virtual const
AggregatorFunctionInterface
getAggregator (const std::string &name) const =0
 Get a const reference to a statistics collector function object that implements the collection of some counting of document parts. More...
 
virtual const
PatternLexerInterface
getPatternLexer (const std::string &name) const =0
 Get a const reference to a pattern lexer. More...
 
virtual const
PatternMatcherInterface
getPatternMatcher (const std::string &name) const =0
 Get a const reference to a pattern lexer. More...
 
virtual const
PatternTermFeederInterface
getPatternTermFeeder () const =0
 Get the default pattern term feeder interface for post processing pattern matching on analyzer output. More...
 
virtual PosTaggerDataInterfacecreatePosTaggerData (TokenizerFunctionInstanceInterface *tokenizer) const =0
 Create a data structure to feed with POS tagging info. More...
 
virtual const PosTaggerInterfacegetPosTagger () const =0
 Get the default POS tagger interface to do POS tagging of documents. More...
 
virtual
TokenMarkupInstanceInterface
createTokenMarkupInstance () const =0
 Create an interface for markup of content. More...
 
virtual bool detectDocumentClass (analyzer::DocumentClass &dclass, const char *contentBegin, std::size_t contentBeginSize, bool isComplete) const =0
 Detect the document class from a document start chunk and set the content description attributes. More...
 
virtual void defineDocumentClassDetector (DocumentClassDetectorInterface *detector)=0
 Define a content detector. More...
 
virtual void defineSegmenter (const std::string &name, SegmenterInterface *segmenter)=0
 Define a document segmenter by name. More...
 
virtual void defineSegmenterOptions (const std::string &scheme, const analyzer::SegmenterOptions &options)=0
 Define segmenter optione by document scheme identifier. More...
 
virtual void defineTokenizer (const std::string &name, TokenizerFunctionInterface *tokenizer)=0
 Define a tokenizer by name. More...
 
virtual void defineNormalizer (const std::string &name, NormalizerFunctionInterface *normalizer)=0
 Define a normalizer by name. More...
 
virtual void defineAggregator (const std::string &name, AggregatorFunctionInterface *aggregator)=0
 Define an aggregator function by name. More...
 
virtual void definePatternLexer (const std::string &name, PatternLexerInterface *lexer)=0
 Define a pattern matching lexer by name. More...
 
virtual void definePatternMatcher (const std::string &name, PatternMatcherInterface *matcher)=0
 Define a pattern matcher by name. More...
 
virtual std::vector< std::string > getFunctionList (const FunctionType &type) const =0
 Get a list of all functions of a specific type available. More...
 

Static Public Member Functions

static const char * functionTypeName (FunctionType t)
 

Detailed Description

Interface for the object providing tokenizers and normalizers used for creating terms from segments of text and functions for collecting overall document statistics.

Member Enumeration Documentation

Function type for fetching descriptions of available functions.

Enumerator
Segmenter 

Addresses a document segmenter.

TokenizerFunction 

Addresses a tokenizer.

NormalizerFunction 

Addresses a normalizer.

AggregatorFunction 

Addresses an aggregator.

PatternLexer 

Addresses a pattern lexer.

PatternMatcher 

Addresses a pattern matcher.

Constructor & Destructor Documentation

virtual strus::TextProcessorInterface::~TextProcessorInterface ( )
inlinevirtual

Desructor.

Member Function Documentation

virtual PosTaggerDataInterface* strus::TextProcessorInterface::createPosTaggerData ( TokenizerFunctionInstanceInterface tokenizer) const
pure virtual

Create a data structure to feed with POS tagging info.

Parameters
[in]tokenizertokenizer to use to split POS tagging entities (with ownership)
Remarks
the tokenization has to be in a granularity smaller than the POS tagger possibly splits. This means that the POS tagger used must not split tokens provided by the tokenizer.
Returns
the POS tagger data interface (with ownership)
virtual TokenMarkupInstanceInterface* strus::TextProcessorInterface::createTokenMarkupInstance ( ) const
pure virtual

Create an interface for markup of content.

Returns
the token markup instance interface
virtual void strus::TextProcessorInterface::defineAggregator ( const std::string &  name,
AggregatorFunctionInterface aggregator 
)
pure virtual

Define an aggregator function by name.

Parameters
[in]namename of the aggregator function to define
[in]aggregatoran aggregator function object (pass ownership)
virtual void strus::TextProcessorInterface::defineDocumentClassDetector ( DocumentClassDetectorInterface detector)
pure virtual

Define a content detector.

Parameters
[in]tokenizera tokenizer object (pass ownership)
virtual void strus::TextProcessorInterface::defineNormalizer ( const std::string &  name,
NormalizerFunctionInterface normalizer 
)
pure virtual

Define a normalizer by name.

Parameters
[in]namename of the normalizer to define
[in]normalizera normalizer object (pass ownership)
virtual void strus::TextProcessorInterface::definePatternLexer ( const std::string &  name,
PatternLexerInterface lexer 
)
pure virtual

Define a pattern matching lexer by name.

Parameters
[in]namename of the lexer to define
[in]lexera lexer object (pass ownership)
virtual void strus::TextProcessorInterface::definePatternMatcher ( const std::string &  name,
PatternMatcherInterface matcher 
)
pure virtual

Define a pattern matcher by name.

Parameters
[in]namename of the pattern matcher to define
[in]matchera pattern matcher object (pass ownership)
virtual void strus::TextProcessorInterface::defineSegmenter ( const std::string &  name,
SegmenterInterface segmenter 
)
pure virtual

Define a document segmenter by name.

Parameters
[in]namename of the document segmenter to define
[in]segmentera document segmenter object (pass ownership)
virtual void strus::TextProcessorInterface::defineSegmenterOptions ( const std::string &  scheme,
const analyzer::SegmenterOptions options 
)
pure virtual

Define segmenter optione by document scheme identifier.

Parameters
[in]schemeidentifier of the document type
[in]optionsattached to this scheme
virtual void strus::TextProcessorInterface::defineTokenizer ( const std::string &  name,
TokenizerFunctionInterface tokenizer 
)
pure virtual

Define a tokenizer by name.

Parameters
[in]namename of the tokenizer to define
[in]tokenizera tokenizer object (pass ownership)
virtual bool strus::TextProcessorInterface::detectDocumentClass ( analyzer::DocumentClass dclass,
const char *  contentBegin,
std::size_t  contentBeginSize,
bool  isComplete 
) const
pure virtual

Detect the document class from a document start chunk and set the content description attributes.

Parameters
[in,out]dclasscontent document class
[in]contentBeginstart chunk of the document with a reasonable size
[in]contentBeginSizesize of chunk passed
[in]isCompletetrue, of the chunk passed is the whole document (this might influence the result)
Returns
true, if the document format was recognized, false else
static const char* strus::TextProcessorInterface::functionTypeName ( FunctionType  t)
inlinestatic
virtual const AggregatorFunctionInterface* strus::TextProcessorInterface::getAggregator ( const std::string &  name) const
pure virtual

Get a const reference to a statistics collector function object that implements the collection of some counting of document parts.

Returns
the statistics collector function reference
virtual std::vector<std::string> strus::TextProcessorInterface::getFunctionList ( const FunctionType type) const
pure virtual

Get a list of all functions of a specific type available.

Parameters
[in]typetype of the function
Returns
the list of function names
virtual const NormalizerFunctionInterface* strus::TextProcessorInterface::getNormalizer ( const std::string &  name) const
pure virtual

Get a const reference to a normalizer object that implements the transformation of a token into a term string.

Returns
the normalizer reference
virtual const PatternLexerInterface* strus::TextProcessorInterface::getPatternLexer ( const std::string &  name) const
pure virtual

Get a const reference to a pattern lexer.

Returns
the pattern lexer
virtual const PatternMatcherInterface* strus::TextProcessorInterface::getPatternMatcher ( const std::string &  name) const
pure virtual

Get a const reference to a pattern lexer.

Returns
the pattern lexer
virtual const PatternTermFeederInterface* strus::TextProcessorInterface::getPatternTermFeeder ( ) const
pure virtual

Get the default pattern term feeder interface for post processing pattern matching on analyzer output.

Returns
the pattern term feeder
virtual const PosTaggerInterface* strus::TextProcessorInterface::getPosTagger ( ) const
pure virtual

Get the default POS tagger interface to do POS tagging of documents.

Returns
the POS tagger interface (with ownership)
virtual std::string strus::TextProcessorInterface::getResourceFilePath ( const std::string &  filename) const
pure virtual

Get the absolute path of a resource file.

Parameters
[in]filenamename of the resource file
virtual const SegmenterInterface* strus::TextProcessorInterface::getSegmenterByMimeType ( const std::string &  mimetype) const
pure virtual

Get a document segmenter object reference that is able to process the specified MIME type.

Parameters
[in]mimetypeMIME type of the document type to process
Returns
a read only document segmenter reference
virtual const SegmenterInterface* strus::TextProcessorInterface::getSegmenterByName ( const std::string &  segmenterName) const
pure virtual

Get a document segmenter object reference.

Parameters
[in]segmenterNamename of the segmenter used (if empty, find the first one loaded or the default one)
Returns
a read only document segmenter reference
virtual analyzer::SegmenterOptions strus::TextProcessorInterface::getSegmenterOptions ( const std::string &  scheme) const
pure virtual

Get the options for a document segmenter for a specific document type.

Parameters
[in]schemedocument scheme identifier identifying the type of document and its external structure definition
virtual const TokenizerFunctionInterface* strus::TextProcessorInterface::getTokenizer ( const std::string &  name) const
pure virtual

Get a const reference to a tokenizer object that implements the splitting of a text segments into tokens.

Returns
the tokenizer reference

The documentation for this class was generated from the following file: