pybool_ir.index#
Provides all the functionality for indexing and searching documents. It includes the Document class, which is used to represent documents in pybool_ir. It also includes generic and off-the-shelf indexing pipelines.
- class pybool_ir.index.Indexer(index_path: Path | str, store_fields: bool = True, optional_fields: List[str] | None = None)#
Bases:
ABC
Base class that provides the basic functionality for indexing and searching documents. By default, this class provides no ability to search documents without directly using the lucene API.
- add_document(doc: Document, optional_fields: Dict[str, Callable[[Document], Any]] | None = None) None #
Add a single document to the index. This method is called by bulk_index.
optional_fields is a dictionary of field names to functions that take a document and return a value for that field. This is useful for adding fields that are not part of the document, but are derived from the document, calculated at index time.
- bulk_index(fname: Path | str, optional_fields: Dict[str, Callable[[Document], Any]] | None = None)#
Index a collection of documents from a file or directory.
- index: Indexer#
The underlying lucene index.
- abstract parse_documents(fname: ~pathlib.Path) -> (typing.Iterable[pybool_ir.index.document.Document], <class 'int'>)#
Return an iterable of documents from a path. Depending on different ways documents can be stored, indexers might have multiple ways to store files. This method chooses the best way to parse a file given the filename.
- abstract set_index_fields(store_fields: bool = False)#
Set fields of the index. Off-the-shelf implementations of indexing particular collections require specific fields in lucene to be set.
Modules
Implementation for how documents are represented in pybool_ir. |
|
Generic indexers and searchers for JSONL and JSONLD files. |
|
Base classes for indexing and searching documents. |
|
Off-the-shelf indexer for PubMed articles. |