The project strus provides some libraries for building a search engine for information retrieval. This engine is able to evaluate structured queries on unstructured text as well as implenting classical information retrieval. It is independent from the key value store database impementation. Current database implementation is based on levelDB. The project is hosted at github.
strus defines the evaluation of a query based on 3 types of operations:
- Fetching and joining of feature occurrencies. The feature ocurrencies, also called postings are represented as sets of pairs {(d,p) | d is a document number, p is a position }, where d and p are positive integer numbers. d is a unique id of the document in the storage while p is representing the term position in the document. These sets are built from the basic feature occurrencies of terms stored in the storage (See here). Together with the set join operators provided by the query processor interface, you can build representations of more complex structures. The basic set join operators are the following:
- Basic operators of the boolean algebra of sets of (d, p) pairs: intersection, union and relative complement.
- Unary set construction operators like the successor set A+ of A defined as {(d,p) | (d,p-1) element of A} and the predecessor set defined as {(d,p) | (d,p+1) element of A and p ≠ 0} .
- N-ary set selection operators that select postings with help of context information. For example within_struct: get the first element in an interval that is not bigger than the defined maximum range containing at least one element of each input set without overlapping a specified delimiter token.
- Weighting of documents based on the feature occurrencies Weighting defines how documents are ranked in a search result. It is defined by weighting functions, that take iterators on the feature occurrencies and some numeric parameters as input to calculate the weight of a document. You can define scalar functions to combine several weighting functions to one.
- Summarization (extraction of content) Summarization is used to extract content from matching documents. With summarization you can do various things:
- Extract the best matching passages of the query in a document to present it as summary of the rank to the user.
- Extract features close to matching passages for feature selection, categorization, clustering, query answering, etc.
Architecture
The strus core defines components that are implemented as libraries.
- queryeval Query evaluation: Interpretes the query and uses the operators defined in the query processor for its execution.
- queryproc Query processor: Map to access functions by name, like the set operations on feature occurrencies, the weighting functions and the summarizers to augment the results.
- storage Storage: Defines the storage where the all retrievable information is stored. Implements the access of statistics and the occurrencies of the basic terms.
- database Key/value store database: Implements the storing and retrieval of the storage data blocks. Currently there exists only one implementation based on levelDB.