pybool_ir.index#

Provides all the functionality for indexing and searching documents. It includes the Document class, which is used to represent documents in pybool_ir. It also includes generic and off-the-shelf indexing pipelines.

class pybool_ir.index.Indexer(index_path: Path | str, store_fields: bool = True, optional_fields: List[str] | None = None)#

Bases: ABC

Base class that provides the basic functionality for indexing and searching documents. By default, this class provides no ability to search documents without directly using the lucene API.

add_document(doc: Document, optional_fields: Dict[str, Callable[[Document], Any]] | None = None) None#

Add a single document to the index. This method is called by bulk_index.

optional_fields is a dictionary of field names to functions that take a document and return a value for that field. This is useful for adding fields that are not part of the document, but are derived from the document, calculated at index time.

bulk_index(fname: Path | str, optional_fields: Dict[str, Callable[[Document], Any]] | None = None)#

Index a collection of documents from a file or directory.

index: Indexer#

The underlying lucene index.

abstract parse_documents(fname: ~pathlib.Path) -> (typing.Iterable[pybool_ir.index.document.Document], <class 'int'>)#

Return an iterable of documents from a path. Depending on different ways documents can be stored, indexers might have multiple ways to store files. This method chooses the best way to parse a file given the filename.

abstract process_document(doc: Document) Document#

Get a document ready for indexing.

abstract set_index_fields(store_fields: bool = False)#

Set fields of the index. Off-the-shelf implementations of indexing particular collections require specific fields in lucene to be set.

Modules

pybool_ir.index.document

Implementation for how documents are represented in pybool_ir.

pybool_ir.index.generic

Generic indexers and searchers for JSONL and JSONLD files.

pybool_ir.index.index

Base classes for indexing and searching documents.

pybool_ir.index.ir_datasets

pybool_ir.index.pubmed

Off-the-shelf indexer for PubMed articles.