pybool_ir.index.index#

Base classes for indexing and searching documents.

Classes

Indexer(index_path[, store_fields, ...])

Base class that provides the basic functionality for indexing and searching documents.

SearcherMixin()

Include this mixin to add search functionality to an Indexer.

class pybool_ir.index.index.Indexer(index_path: Path | str, store_fields: bool = True, optional_fields: List[str] | None = None)#

Bases: ABC

Base class that provides the basic functionality for indexing and searching documents. By default, this class provides no ability to search documents without directly using the lucene API.

add_document(doc: Document, optional_fields: Dict[str, Callable[[Document], Any]] | None = None) None#

Add a single document to the index. This method is called by bulk_index.

optional_fields is a dictionary of field names to functions that take a document and return a value for that field. This is useful for adding fields that are not part of the document, but are derived from the document, calculated at index time.

bulk_index(fname: Path | str, optional_fields: Dict[str, Callable[[Document], Any]] | None = None)#

Index a collection of documents from a file or directory.

index: Indexer#

The underlying lucene index.

abstract parse_documents(fname: ~pathlib.Path) -> (typing.Iterable[pybool_ir.index.document.Document], <class 'int'>)#

Return an iterable of documents from a path. Depending on different ways documents can be stored, indexers might have multiple ways to store files. This method chooses the best way to parse a file given the filename.

abstract process_document(doc: Document) Document#

Get a document ready for indexing.

abstract set_index_fields(store_fields: bool = False)#

Set fields of the index. Off-the-shelf implementations of indexing particular collections require specific fields in lucene to be set.

class pybool_ir.index.index.SearcherMixin#

Bases: ABC

Include this mixin to add search functionality to an Indexer.

abstract search(query: str, n_hits=10) List[Document]#

Given a query, return the top n_hits documents. When n_hits is None, return all documents that match the query.

abstract search_fmt(query: str, n_hits=10, hit_formatter: str | None = None) None#

Perform a search and print the results.