pybool_ir.index.ir_datasets#

Classes

IRDatasetsIndexer(index_path, dataset_name)

class pybool_ir.index.ir_datasets.IRDatasetsIndexer(index_path: Path | str, dataset_name: str, store_fields: bool = True, optional_fields: List[str] | None = None)#

Bases: Indexer

bulk_index()#

Index a collection of documents from a file or directory.

parse_documents() -> (typing.Iterable[pybool_ir.index.document.Document], <class 'int'>)#

Return an iterable of documents from a path. Depending on different ways documents can be stored, indexers might have multiple ways to store files. This method chooses the best way to parse a file given the filename.

process_document(doc: Document) Document#

Get a document ready for indexing.

set_index_fields(store_fields: bool = False)#

Set fields of the index. Off-the-shelf implementations of indexing particular collections require specific fields in lucene to be set.