strusAnalyzer  0.17
Namespaces | Classes | Typedefs | Enumerations | Functions
strus Namespace Reference

strus toplevel namespace More...

Namespaces

 analyzer
 analyzer parameter and return value objects namespace
 

Classes

class  AggregatorFunctionInstanceInterface
 Interface for a parameterized aggregator function. More...
 
class  AggregatorFunctionInterface
 Interface for the aggregator function constructor. More...
 
class  AnalyzerObjectBuilderInterface
 Interface providing a mechanism to create complex multi component objects for the document and query analysis in strus. More...
 
class  ContentIteratorInterface
 Defines an iterator on content provided by a segmenter. More...
 
class  ContentStatisticsContextInterface
 Defines a program for analyzing a document, splitting it into normalized terms that can be fed to the strus IR engine. More...
 
class  ContentStatisticsInterface
 Defines a program for analyzing a document, splitting it into normalized terms that can be fed to the strus IR engine. More...
 
class  DocumentAnalyzerContextInterface
 Defines the context for analyzing multi part documents, iterating on the sub documents defined, splitting them into normalized terms that can be fed to the strus IR engine. More...
 
class  DocumentAnalyzerInstanceInterface
 Defines a program for analyzing a document, splitting it into normalized terms that can be fed to the strus IR engine. More...
 
class  DocumentAnalyzerMapInterface
 Defines a program for analyzing a document, splitting it into normalized terms that can be fed to the strus IR engine. More...
 
class  DocumentClassDetectorInterface
 Defines a detector that returns a content description for a document content it recognizes. More...
 
class  TagAttributeMarkupInterface
 
class  DocumentTagMarkupDef
 
class  PatternResultFormatContext
 Context for mapping result format strings (allocator,maps,etc.) More...
 
class  PatternResultFormatVariableMap
 Interface to map variables to a pointer to string. More...
 
class  PatternResultFormatTable
 Parser for result format strings. More...
 
struct  PatternResultFormatChunk
 Single chunk of a result format for iterating ans build the pattern match result. More...
 
class  PatternResultFormatMap
 Result format for the output of pattern match results with names of members as variables in curly brackets '{' '}'. More...
 
class  PatternSerializer
 Object with all interfaces needed for serialization. More...
 
class  NormalizerFunctionInstanceInterface
 Interface for a parameterized normalization function. More...
 
class  NormalizerFunctionInterface
 Interface for the normalizer constructor. More...
 
class  PatternLexerContextInterface
 Interface for detecting lexems used as basic entities by pattern matching in text. More...
 
class  PatternLexerInstanceInterface
 Interface for building the automaton for detecting lexems used as basic entities by pattern matching in text. More...
 
class  PatternLexerInterface
 Interface for instantiating the data structure of an automaton for detecting lexems used as basic entities by pattern matching in text. More...
 
class  PatternMatcherContextInterface
 Interface for detecting patterns (structures formed by atomic tokens) in one document. More...
 
class  PatternMatcherInstanceInterface
 Interface for building the automaton for detecting patterns in text. More...
 
class  PatternMatcherInterface
 Interface for creating an automaton for detecting patterns of tokens in a document stream. More...
 
class  PatternTermFeederInstanceInterface
 Instance interface for defining a mapping of terms of the document analysis outout as lexems used as basic entities by pattern matching. More...
 
class  PatternTermFeederInterface
 Interface for instantiating the data structure of an automaton for detecting lexems used as basic entities by pattern matching in text. More...
 
class  PosTaggerContextInterface
 Context to markup documents with tags derived from POS tagging. More...
 
class  PosTaggerDataInterface
 Interface for the data built by a POS tagger. More...
 
class  PosTaggerInstanceInterface
 Interface to define a POS tagger instance for creating the input for POS tagging to build the data and to create to context for tagging with the data build from the POS tagging output. More...
 
class  PosTaggerInterface
 Interface for the construction of a POS tagger instance for a specified segmenter. More...
 
class  QueryAnalyzerContextInterface
 Defines the context for analyzing queries for the strus IR engine. More...
 
class  QueryAnalyzerInstanceInterface
 Defines a program for analyzing chunks of a query. More...
 
class  SegmenterContextInterface
 Defines the context for segmenting one document. More...
 
class  SegmenterInstanceInterface
 Defines a program for splitting a source text it into chunks with an id correspoding to a selecting expression. More...
 
class  SegmenterInterface
 Defines an interface for creating instances of programs for document segmentation. More...
 
class  SegmenterMarkupContextInterface
 Defines the context for inserting markups into one document. More...
 
class  TextProcessorInterface
 Interface for the object providing tokenizers and normalizers used for creating terms from segments of text and functions for collecting overall document statistics. More...
 
class  TokenizerFunctionInstanceInterface
 Interface for tokenization. More...
 
class  TokenizerFunctionInterface
 Interface for a tokenizer function. More...
 
class  TokenMarkupContextInterface
 Interface for annotation of text in one document. More...
 
class  TokenMarkupInstanceInterface
 Interface for building the automaton for detecting patterns of tokens in a document stream. More...
 

Typedefs

typedef struct PatternResultFormat PatternResultFormat
 Result format representation (hidden implementation) More...
 
typedef int SegmenterPosition
 Position of a segment in the original source. More...
 

Enumerations

enum  PatternSerializerType { PatternMatcherWithLexer, PatternMatcherWithFeeder }
 Defines different types of pattern matchers to serialize. More...
 

Functions

AggregatorFunctionInterfacecreateAggregator_typeset (ErrorBufferInterface *errorhnd)
 Get the aggregator function type for the cosine measure normalization factor. More...
 
AggregatorFunctionInterfacecreateAggregator_valueset (ErrorBufferInterface *errorhnd)
 
AggregatorFunctionInterfacecreateAggregator_sumSquareTf (ErrorBufferInterface *errorhnd)
 Get the aggregator function type for the cosine measure normalization factor. More...
 
DocumentAnalyzerInstanceInterfacecreateDocumentAnalyzer (const TextProcessorInterface *textproc, const SegmenterInterface *segmenter, const analyzer::SegmenterOptions &opts, ErrorBufferInterface *errorhnd)
 Creates a parameterizable analyzer instance for analyzing documents. More...
 
QueryAnalyzerInstanceInterfacecreateQueryAnalyzer (ErrorBufferInterface *errorhnd)
 Creates a parameterizable analyzer instance for analyzing queries. More...
 
DocumentAnalyzerMapInterfacecreateDocumentAnalyzerMap (const AnalyzerObjectBuilderInterface *objbuilder, ErrorBufferInterface *errorhnd)
 Creates a analyzer map for bundling different instances of analyzers for different classes of documents. More...
 
AnalyzerObjectBuilderInterfacecreateAnalyzerObjectBuilder_default (const FileLocatorInterface *filelocator, ErrorBufferInterface *errorhnd)
 Create a storage object builder with the builders from the standard strus core libraries (without module support) More...
 
analyzer::DocumentClass parse_DocumentClass (const std::string &src, ErrorBufferInterface *errorhnd)
 parse the document class from source More...
 
bool load_DocumentAnalyzer_program_std (DocumentAnalyzerInstanceInterface *analyzer, const TextProcessorInterface *textproc, const std::string &content, ErrorBufferInterface *errorhnd)
 Load a program given as source without includes to a document analyzer. More...
 
bool load_DocumentAnalyzer_programfile_std (DocumentAnalyzerInstanceInterface *analyzer, const TextProcessorInterface *textproc, const std::string &filename, ErrorBufferInterface *errorhnd)
 Load a program given as source file name to a document analyzer, recursively expanding include directives (C preprocessor style) at the beginning of the source to load. More...
 
bool load_QueryAnalyzer_program_std (QueryAnalyzerInstanceInterface *analyzer, const TextProcessorInterface *textproc, const std::string &content, ErrorBufferInterface *errorhnd)
 Load a program given as source without includes to a document analyzer. More...
 
bool load_QueryAnalyzer_programfile_std (QueryAnalyzerInstanceInterface *analyzer, const TextProcessorInterface *textproc, const std::string &filename, ErrorBufferInterface *errorhnd)
 Load a program given as source file name to a query analyzer, recursively expanding include directives (C preprocessor style) at the beginning of the source to load. More...
 
bool is_DocumentAnalyzer_programfile (const TextProcessorInterface *textproc, const std::string &filename, ErrorBufferInterface *errorhnd)
 Test if a file is an analyzer program file. More...
 
bool is_DocumentAnalyzer_program (const std::string &source, ErrorBufferInterface *errorhnd)
 Test if a file is an analyzer program file. More...
 
bool load_DocumentAnalyzerMap_program (DocumentAnalyzerMapInterface *analyzermap, const TextProcessorInterface *textproc, const std::string &source, ErrorBufferInterface *errorhnd)
 Load a map of definitions describing how different document types are mapped to an analyzer program from its source. More...
 
bool load_DocumentAnalyzerMap_programfile (DocumentAnalyzerMapInterface *analyzermap, const TextProcessorInterface *textproc, const std::string &filename, ErrorBufferInterface *errorhnd)
 Load a map of definitions describing how different document types are mapped to an analyzer program from a file. More...
 
bool load_PatternMatcher_program (const TextProcessorInterface *textproc, PatternTermFeederInstanceInterface *feeder, PatternMatcherInstanceInterface *matcher, const std::string &content, ErrorBufferInterface *errorhnd)
 Load a pattern matcher program with a term feeder from source. More...
 
bool load_PatternMatcher_programfile (const TextProcessorInterface *textproc, PatternTermFeederInstanceInterface *feeder, PatternMatcherInstanceInterface *matcher, const std::string &filename, ErrorBufferInterface *errorhnd)
 Load a pattern matcher program with a term feeder from a resource file. More...
 
bool load_PatternMatcher_program (const TextProcessorInterface *textproc, PatternLexerInstanceInterface *lexer, PatternMatcherInstanceInterface *matcher, const std::string &content, ErrorBufferInterface *errorhnd)
 Load a pattern matcher program with a lexer from source. More...
 
bool load_PatternMatcher_programfile (const TextProcessorInterface *textproc, PatternLexerInstanceInterface *lexer, PatternMatcherInstanceInterface *matcher, const std::string &filename, ErrorBufferInterface *errorhnd)
 Load a pattern matcher program with a lexer from a resource file. More...
 
ContentStatisticsInterfacecreateContentStatistics_std (const TextProcessorInterface *textproc, const DocumentClassDetectorInterface *detector, ErrorBufferInterface *errorhnd)
 Get the standard content statistics. More...
 
DocumentClassDetectorInterfacecreateDetector_std (const TextProcessorInterface *textproc, ErrorBufferInterface *errorhnd)
 Get the standard content detector (with ownership) More...
 
std::string markupDocumentTags (const analyzer::DocumentClass &documentClass, const std::string &content, const std::vector< DocumentTagMarkupDef > &markups, const TextProcessorInterface *textproc, ErrorBufferInterface *errorhnd)
 Analyze a content and put markups on every tag matching an expression. More...
 
TokenMarkupInstanceInterfacecreateTokenMarkupInstance_standard (ErrorBufferInterface *errorhnd)
 Create the interface for markup of tokens in a document text. More...
 
NormalizerFunctionInterfacecreateNormalizer_lowercase (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the lower case of the input as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_uppercase (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the upper case of the input as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_convdia (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the conversion of diacritical characters to ascii of the input as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_charselect (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the selection of characters defined by named sets as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_date2int (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the conversion of the input date as number (various units configurable base) More...
 
NormalizerFunctionInterfacecreateNormalizer_dictmap (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the mapping of the input with a dictionary as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_ngram (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the ngrams of the input as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_regex (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the mapping of the input with help of regular expressions as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_snowball (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the stemming of the input with the snowball stemmer as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_substrindex (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the input trimmed as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_substrmap (ErrorBufferInterface *errorhnd)
 
NormalizerFunctionInterfacecreateNormalizer_trim (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the input trimmed as result. More...
 
NormalizerFunctionInterfacecreateNormalizer_wordjoin (ErrorBufferInterface *errorhnd)
 Get the normalizer that returns the words tokenized from the input joined. More...
 
bool isPatternSerializerContent (const std::string &m_itrcontent, ErrorBufferInterface *errorhnd)
 Evaluate, if a content is a pattern serialization. More...
 
PatternSerializercreatePatternSerializer (const std::string &filename, const PatternSerializerType &serializerType, ErrorBufferInterface *errorhnd)
 Create a serializer of patterns loaded. More...
 
PatternSerializercreatePatternSerializerText (std::ostream &output, const PatternSerializerType &serializerType, ErrorBufferInterface *errorhnd)
 Create a serializer of patterns loaded as text to a stream. More...
 
bool loadPatternMatcherFromSerialization (const std::string &source, PatternLexerInstanceInterface *lexer, PatternMatcherInstanceInterface *matcher, ErrorBufferInterface *errorhnd)
 Instantiate pattern matching interfaces from serialization. More...
 
bool loadPatternMatcherFromSerialization (const std::string &source, PatternTermFeederInstanceInterface *feeder, PatternMatcherInstanceInterface *matcher, ErrorBufferInterface *errorhnd)
 Instantiate pattern matching interfaces from serialization. More...
 
PatternTermFeederInterfacecreatePatternTermFeeder_default (ErrorBufferInterface *errorhnd)
 Create the term feeder interface for pattern matching on analyzer output as input. More...
 
PatternLexerInterfacecreatePatternLexer_test (ErrorBufferInterface *errorhnd)
 Create the interface for regular expression matching usable as groud truth for testing. More...
 
PatternMatcherInterfacecreatePatternMatcher_test (ErrorBufferInterface *errorhnd)
 Create the interface for pattern matching usable as groud truth for testing. More...
 
PosTaggerDataInterfacecreatePosTaggerData_standard (TokenizerFunctionInstanceInterface *tokenizer, ErrorBufferInterface *errorhnd)
 Create an interface for building up the data to tag documents with. More...
 
PosTaggerInterfacecreatePosTagger_standard (ErrorBufferInterface *errorhnd)
 Create an interface for the construction of a POS tagger instance for a specified segmenter. More...
 
SegmenterInterfacecreateSegmenter_cjson (ErrorBufferInterface *errorhnd)
 Get a document JSON segmenter based on cjson. More...
 
std::vector< std::string > splitJsonDocumentList (const std::string &encoding, const std::string &content, ErrorBufferInterface *errorhnd)
 
SegmenterInterfacecreateSegmenter_plain (ErrorBufferInterface *errorhnd)
 Get a document plain text segmenter. More...
 
SegmenterInterfacecreateSegmenter_textwolf (ErrorBufferInterface *errorhnd)
 Get a document XML segmenter based on textwolf. More...
 
SegmenterInterfacecreateSegmenter_tsv (ErrorBufferInterface *errorhnd)
 Get a document segmenter using tab-separated files as input. More...
 
TextProcessorInterfacecreateTextProcessor (const FileLocatorInterface *filelocator, ErrorBufferInterface *errorhnd)
 Create a text processor. More...
 
TokenizerFunctionInterfacecreateTokenizer_punctuation (ErrorBufferInterface *errorhnd)
 Get the tokenizer type that creates the tokenization of punctuation elements in the input. More...
 
TokenizerFunctionInterfacecreateTokenizer_regex (ErrorBufferInterface *errorhnd)
 Get the tokenizer type that creates the tokenization with help of regular expressions. More...
 
TokenizerFunctionInterfacecreateTokenizer_textcat (const TextProcessorInterface *textproc, ErrorBufferInterface *errorhnd)
 Get the tokenizer type that creates the tokenization of words in a recognized language. More...
 
TokenizerFunctionInterfacecreateTokenizer_word (ErrorBufferInterface *errorhnd)
 Get the tokenizer type that creates the tokenization of words in the input. More...
 
TokenizerFunctionInterfacecreateTokenizer_whitespace (ErrorBufferInterface *errorhnd)
 Get the tokenizer type that creates the tokenization as splitting of the input by whitespaces. More...
 
TokenizerFunctionInterfacecreateTokenizer_langtoken (ErrorBufferInterface *errorhnd)
 Get the tokenizer type that creates the tokenization as splitting of all tokens, returning sequnces of language characters as tokens and word boundary delimiters as single character. More...
 

Detailed Description

strus toplevel namespace

Exported functions for the program loader of the analyzer (load program in a domain specific language)

Typedef Documentation

Result format representation (hidden implementation)

Position of a segment in the original source.

Enumeration Type Documentation

Defines different types of pattern matchers to serialize.

Enumerator
PatternMatcherWithLexer 
PatternMatcherWithFeeder 

Function Documentation

AggregatorFunctionInterface* strus::createAggregator_sumSquareTf ( ErrorBufferInterface *  errorhnd)

Get the aggregator function type for the cosine measure normalization factor.

Returns
the aggregator function
AggregatorFunctionInterface* strus::createAggregator_typeset ( ErrorBufferInterface *  errorhnd)

Get the aggregator function type for the cosine measure normalization factor.

Returns
the aggregator function
AggregatorFunctionInterface* strus::createAggregator_valueset ( ErrorBufferInterface *  errorhnd)
AnalyzerObjectBuilderInterface* strus::createAnalyzerObjectBuilder_default ( const FileLocatorInterface *  filelocator,
ErrorBufferInterface *  errorhnd 
)

Create a storage object builder with the builders from the standard strus core libraries (without module support)

Parameters
[in]filelocatorresources and file locator interface
[in]errorhnderror buffer interface
ContentStatisticsInterface* strus::createContentStatistics_std ( const TextProcessorInterface *  textproc,
const DocumentClassDetectorInterface *  detector,
ErrorBufferInterface *  errorhnd 
)

Get the standard content statistics.

Returns
the standard content statistics interface (with ownership)
DocumentClassDetectorInterface* strus::createDetector_std ( const TextProcessorInterface *  textproc,
ErrorBufferInterface *  errorhnd 
)

Get the standard content detector (with ownership)

Returns
the content detector class
DocumentAnalyzerInstanceInterface* strus::createDocumentAnalyzer ( const TextProcessorInterface *  textproc,
const SegmenterInterface *  segmenter,
const analyzer::SegmenterOptions &  opts,
ErrorBufferInterface *  errorhnd 
)

Creates a parameterizable analyzer instance for analyzing documents.

Parameters
[in]segmentersegmenter type to be used by the created analyzer.
[in]textproctext processor for creating functions and resources needed for analysis
[in]segmentersegmenter type
[in]optsoptions for the segmenter
[in]errorhnderror buffer interface
Returns
the analyzer program (with ownership)
DocumentAnalyzerMapInterface* strus::createDocumentAnalyzerMap ( const AnalyzerObjectBuilderInterface *  objbuilder,
ErrorBufferInterface *  errorhnd 
)

Creates a analyzer map for bundling different instances of analyzers for different classes of documents.

Parameters
[in]objbuilderanalyzer object builder interface
[in]errorhnderror buffer interface
Returns
the analyzer program (with ownership)
NormalizerFunctionInterface* strus::createNormalizer_charselect ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the selection of characters defined by named sets as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_convdia ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the conversion of diacritical characters to ascii of the input as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_date2int ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the conversion of the input date as number (various units configurable base)

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_dictmap ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the mapping of the input with a dictionary as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_lowercase ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the lower case of the input as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_ngram ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the ngrams of the input as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_regex ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the mapping of the input with help of regular expressions as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_snowball ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the stemming of the input with the snowball stemmer as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_substrindex ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the input trimmed as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_substrmap ( ErrorBufferInterface *  errorhnd)
NormalizerFunctionInterface* strus::createNormalizer_trim ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the input trimmed as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_uppercase ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the upper case of the input as result.

Returns
the normalization function
NormalizerFunctionInterface* strus::createNormalizer_wordjoin ( ErrorBufferInterface *  errorhnd)

Get the normalizer that returns the words tokenized from the input joined.

Returns
the normalization function
PatternLexerInterface* strus::createPatternLexer_test ( ErrorBufferInterface *  errorhnd)

Create the interface for regular expression matching usable as groud truth for testing.

PatternMatcherInterface* strus::createPatternMatcher_test ( ErrorBufferInterface *  errorhnd)

Create the interface for pattern matching usable as groud truth for testing.

PatternSerializer* strus::createPatternSerializer ( const std::string &  filename,
const PatternSerializerType &  serializerType,
ErrorBufferInterface *  errorhnd 
)

Create a serializer of patterns loaded.

Parameters
[in]filenamepath to file where to write the output to
[in]serializerTypetype of serialization
[in]errorhnderror buffer interface
PatternSerializer* strus::createPatternSerializerText ( std::ostream &  output,
const PatternSerializerType &  serializerType,
ErrorBufferInterface *  errorhnd 
)

Create a serializer of patterns loaded as text to a stream.

Parameters
[in]outputwhere to print text output to
[in]serializerTypetype of serialization
[in]errorhnderror buffer interface
PatternTermFeederInterface* strus::createPatternTermFeeder_default ( ErrorBufferInterface *  errorhnd)

Create the term feeder interface for pattern matching on analyzer output as input.

Parameters
[in]errorhnderror buffer interface
PosTaggerInterface* strus::createPosTagger_standard ( ErrorBufferInterface *  errorhnd)

Create an interface for the construction of a POS tagger instance for a specified segmenter.

Parameters
[in]errorhnderror buffer interface for exceptions thrown
Returns
the POS tagger base interface
PosTaggerDataInterface* strus::createPosTaggerData_standard ( TokenizerFunctionInstanceInterface *  tokenizer,
ErrorBufferInterface *  errorhnd 
)

Create an interface for building up the data to tag documents with.

Parameters
[in]tokenizertokenizer interface to use (passed with ownership)
[in]errorhnderror buffer interface for exceptions thrown
Returns
the structure to collect POS tagging output
QueryAnalyzerInstanceInterface* strus::createQueryAnalyzer ( ErrorBufferInterface *  errorhnd)

Creates a parameterizable analyzer instance for analyzing queries.

Parameters
[in]errorhnderror buffer interface
Returns
the analyzer program (with ownership)
SegmenterInterface* strus::createSegmenter_cjson ( ErrorBufferInterface *  errorhnd)

Get a document JSON segmenter based on cjson.

Returns
the segmenter
SegmenterInterface* strus::createSegmenter_plain ( ErrorBufferInterface *  errorhnd)

Get a document plain text segmenter.

Returns
the segmenter
SegmenterInterface* strus::createSegmenter_textwolf ( ErrorBufferInterface *  errorhnd)

Get a document XML segmenter based on textwolf.

Returns
the segmenter
SegmenterInterface* strus::createSegmenter_tsv ( ErrorBufferInterface *  errorhnd)

Get a document segmenter using tab-separated files as input.

Returns
the segmenter
TextProcessorInterface* strus::createTextProcessor ( const FileLocatorInterface *  filelocator,
ErrorBufferInterface *  errorhnd 
)

Create a text processor.

Returns
the constructed text processor
TokenizerFunctionInterface* strus::createTokenizer_langtoken ( ErrorBufferInterface *  errorhnd)

Get the tokenizer type that creates the tokenization as splitting of all tokens, returning sequnces of language characters as tokens and word boundary delimiters as single character.

Returns
the tokenization function
TokenizerFunctionInterface* strus::createTokenizer_punctuation ( ErrorBufferInterface *  errorhnd)

Get the tokenizer type that creates the tokenization of punctuation elements in the input.

Returns
the tokenization function
TokenizerFunctionInterface* strus::createTokenizer_regex ( ErrorBufferInterface *  errorhnd)

Get the tokenizer type that creates the tokenization with help of regular expressions.

Returns
the tokenization function
TokenizerFunctionInterface* strus::createTokenizer_textcat ( const TextProcessorInterface *  textproc,
ErrorBufferInterface *  errorhnd 
)

Get the tokenizer type that creates the tokenization of words in a recognized language.

Returns
the tokenization function
TokenizerFunctionInterface* strus::createTokenizer_whitespace ( ErrorBufferInterface *  errorhnd)

Get the tokenizer type that creates the tokenization as splitting of the input by whitespaces.

Returns
the tokenization function
TokenizerFunctionInterface* strus::createTokenizer_word ( ErrorBufferInterface *  errorhnd)

Get the tokenizer type that creates the tokenization of words in the input.

Returns
the tokenization function
TokenMarkupInstanceInterface* strus::createTokenMarkupInstance_standard ( ErrorBufferInterface *  errorhnd)

Create the interface for markup of tokens in a document text.

bool strus::is_DocumentAnalyzer_program ( const std::string &  source,
ErrorBufferInterface *  errorhnd 
)

Test if a file is an analyzer program file.

Parameters
[in]filenamename of the file to load
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure (inspect errorhnd for errors)
bool strus::is_DocumentAnalyzer_programfile ( const TextProcessorInterface *  textproc,
const std::string &  filename,
ErrorBufferInterface *  errorhnd 
)

Test if a file is an analyzer program file.

Parameters
[in]textproctext processor interface to determine the path of the filename
[in]filenamename of the file to load
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure (inspect errorhnd for errors)
bool strus::isPatternSerializerContent ( const std::string &  m_itrcontent,
ErrorBufferInterface *  errorhnd 
)

Evaluate, if a content is a pattern serialization.

Parameters
[in]contentcontent to check
bool strus::load_DocumentAnalyzer_program_std ( DocumentAnalyzerInstanceInterface *  analyzer,
const TextProcessorInterface *  textproc,
const std::string &  content,
ErrorBufferInterface *  errorhnd 
)

Load a program given as source without includes to a document analyzer.

Parameters
[in,out]analyzeranalyzer object to instrument
[in]sourcesource with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure (inspect errorhnd for errors)
bool strus::load_DocumentAnalyzer_programfile_std ( DocumentAnalyzerInstanceInterface *  analyzer,
const TextProcessorInterface *  textproc,
const std::string &  filename,
ErrorBufferInterface *  errorhnd 
)

Load a program given as source file name to a document analyzer, recursively expanding include directives (C preprocessor style) at the beginning of the source to load.

Parameters
[in,out]analyzeranalyzer object to instrument
[in]filenamename of the file to load
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure (inspect errorhnd for errors)
bool strus::load_DocumentAnalyzerMap_program ( DocumentAnalyzerMapInterface *  analyzermap,
const TextProcessorInterface *  textproc,
const std::string &  source,
ErrorBufferInterface *  errorhnd 
)

Load a map of definitions describing how different document types are mapped to an analyzer program from its source.

Parameters
[in,out]analyzermapmap of analyzers to instrument
[in]textproctext processor interface to determine the path of filenames
[in]sourcesource with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure
bool strus::load_DocumentAnalyzerMap_programfile ( DocumentAnalyzerMapInterface *  analyzermap,
const TextProcessorInterface *  textproc,
const std::string &  filename,
ErrorBufferInterface *  errorhnd 
)

Load a map of definitions describing how different document types are mapped to an analyzer program from a file.

Parameters
[in,out]analyzermapmap of analyzers to instrument
[in]textproctext processor interface to determine the path of filenames
[in]filenamesource with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure
bool strus::load_PatternMatcher_program ( const TextProcessorInterface *  textproc,
PatternTermFeederInstanceInterface *  feeder,
PatternMatcherInstanceInterface *  matcher,
const std::string &  content,
ErrorBufferInterface *  errorhnd 
)

Load a pattern matcher program with a term feeder from source.

Parameters
[in]textproctext processor interface to determine the path of filenames
[in,out]feederterm feeder to instrument
[in,out]matcherpattern matcher to instrument
[in]contentsource with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure
bool strus::load_PatternMatcher_program ( const TextProcessorInterface *  textproc,
PatternLexerInstanceInterface *  lexer,
PatternMatcherInstanceInterface *  matcher,
const std::string &  content,
ErrorBufferInterface *  errorhnd 
)

Load a pattern matcher program with a lexer from source.

Parameters
[in]textproctext processor interface to determine the path of filenames
[in,out]lexertokenization for the pattern matcher
[in,out]matcherpattern matcher to instrument
[in]contentsource with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure
bool strus::load_PatternMatcher_programfile ( const TextProcessorInterface *  textproc,
PatternTermFeederInstanceInterface *  feeder,
PatternMatcherInstanceInterface *  matcher,
const std::string &  filename,
ErrorBufferInterface *  errorhnd 
)

Load a pattern matcher program with a term feeder from a resource file.

Parameters
[in]textproctext processor interface to determine the path of filenames
[in,out]feederterm feeder to instrument
[in,out]matcherpattern matcher to instrument
[in]filenamefile name of source with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure
bool strus::load_PatternMatcher_programfile ( const TextProcessorInterface *  textproc,
PatternLexerInstanceInterface *  lexer,
PatternMatcherInstanceInterface *  matcher,
const std::string &  filename,
ErrorBufferInterface *  errorhnd 
)

Load a pattern matcher program with a lexer from a resource file.

Parameters
[in]textproctext processor interface to determine the path of filenames
[in,out]lexertokenization for the pattern matcher
[in,out]matcherpattern matcher to instrument
[in]filenamefile name of source with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure
bool strus::load_QueryAnalyzer_program_std ( QueryAnalyzerInstanceInterface *  analyzer,
const TextProcessorInterface *  textproc,
const std::string &  content,
ErrorBufferInterface *  errorhnd 
)

Load a program given as source without includes to a document analyzer.

Parameters
[in,out]analyzeranalyzer object to instrument
[in]sourcesource with definitions
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure (inspect errorhnd for errors)
bool strus::load_QueryAnalyzer_programfile_std ( QueryAnalyzerInstanceInterface *  analyzer,
const TextProcessorInterface *  textproc,
const std::string &  filename,
ErrorBufferInterface *  errorhnd 
)

Load a program given as source file name to a query analyzer, recursively expanding include directives (C preprocessor style) at the beginning of the source to load.

Parameters
[in,out]analyzeranalyzer object to instrument
[in]filenamename of the file to load
[in,out]errorhndbuffer for reporting errors (exceptions)
Returns
true on success, false on failure (inspect errorhnd for errors)
bool strus::loadPatternMatcherFromSerialization ( const std::string &  source,
PatternLexerInstanceInterface *  lexer,
PatternMatcherInstanceInterface *  matcher,
ErrorBufferInterface *  errorhnd 
)

Instantiate pattern matching interfaces from serialization.

Parameters
[in]sourcesource content (not a filename!) to read the input from
[in]lexerpattern lexer instance interface to instantiate from deserialization
[in]matcherpattern matcher instance interface to instantiate from deserialization
[in]errorhnderror buffer interface
bool strus::loadPatternMatcherFromSerialization ( const std::string &  source,
PatternTermFeederInstanceInterface *  feeder,
PatternMatcherInstanceInterface *  matcher,
ErrorBufferInterface *  errorhnd 
)

Instantiate pattern matching interfaces from serialization.

Parameters
[in]sourcesource content (not a filename!) to read the input from
[in]feederpattern term feeder instance interface to instantiate from deserialization
[in]matcherpattern matcher instance interface to instantiate from deserialization
[in]errorhnderror buffer interface
std::string strus::markupDocumentTags ( const analyzer::DocumentClass &  documentClass,
const std::string &  content,
const std::vector< DocumentTagMarkupDef > &  markups,
const TextProcessorInterface *  textproc,
ErrorBufferInterface *  errorhnd 
)

Analyze a content and put markups on every tag matching an expression.

Remarks
This function is currently only implemented for XML
Parameters
[in]documentClassdocument class of the content with the encoding specified
[in]contentthe content to process
[in]markupsarray of definitions for markup
[in]textproctext processor interface
[in]errorhnderror buffer for reporting errors/exceptions
Returns
the tagged document
analyzer::DocumentClass strus::parse_DocumentClass ( const std::string &  src,
ErrorBufferInterface *  errorhnd 
)

parse the document class from source

Parameters
[in]srcdocument class definition as string
[in,out]errorhndinterface for reporting errors and exceptions occurred
Returns
the document class structure
std::vector<std::string> strus::splitJsonDocumentList ( const std::string &  encoding,
const std::string &  content,
ErrorBufferInterface *  errorhnd 
)