pybool_ir.index.pubmed#

Off-the-shelf indexer for PubMed articles.

Functions

parse_medline_date(date_str)

Parse a date string from a Medline record.

parse_pubmed_article_node(element)

Parse a PubmedArticle node from a Pubmed XML element.

Classes

PubmedArticle(id, date, title, abstract, ...)

This is a special override of the Document class for PubMed articles.

PubmedIndexer(index_path[, store_fields, ...])

Off-the-shelf indexer for Pubmed XML files.

class pybool_ir.index.pubmed.PubmedArticle(id: str, date: datetime, title: str, abstract: str, publication_type: List[str], mesh_heading_list: List[str], mesh_qualifier_list: List[str], mesh_major_heading_list: List[str], supplementary_concept_list: List[str], keyword_list: List[str], **optional_fields)#

Bases: Document

This is a special override of the Document class for PubMed articles. The constructor takes in all the fields that are required for PubMed articles.

static from_hit(hit: Hit)#

Create a PubmedArticle from a lucene Hit. This method also removes the __id__ and __score__ fields from the hit. A document prior to indexing should be equivalent to a document retrieved from a hit using this method.

class pybool_ir.index.pubmed.PubmedIndexer(index_path: Path | str, store_fields: bool = True, optional_fields: List[str] | None = None)#

Bases: Indexer, SearcherMixin

Off-the-shelf indexer for Pubmed XML files.

>>> from pybool_ir.index.pubmed import PubmedIndexer
>>>
>>> with PubmedIndexer("path/to/index", store_fields=True) as idx:
>>>         idx.bulk_index("path/to/baseline")
parse_documents(baseline_path: ~pathlib.Path) -> (typing.Iterable[pybool_ir.index.document.Document], <class 'int'>)#

Return an iterable of documents from a path. Depending on different ways documents can be stored, indexers might have multiple ways to store files. This method chooses the best way to parse a file given the filename.

process_document(doc: Document) Document#

Get a document ready for indexing.

static read_file(fname: Path) Iterable[Document]#

Read a single file, yielding documents. Supports both XML and GZipped XML files. This is how PubMed documents are stored on the baseline FTP server.

static read_folder(folder: Path) Iterable[Document]#

Read a folder of XML files. This method should be used when the PubMed documents are stored in a folder.

static read_jsonl(file: Path) Iterable[Document]#

Read a JSONL file. This method should be used when the PubMed documents are stored in a JSONL file. The pybool_ir command line tool can be used to convert PubMed XML files to JSONL files. Conversion of the files makes indexing considerably faster since the XML files do not need to be parsed.

search(query: str, n_hits=10) List[Document]#

Given a query, return the top n_hits documents. When n_hits is None, return all documents that match the query.

search_fmt(query: str, n_hits=10, hit_formatter: str | None = None)#

Perform a search and print the results.

set_index_fields(store_fields: bool = False, optional_fields: List[str] | None = None)#

Set fields of the index. Off-the-shelf implementations of indexing particular collections require specific fields in lucene to be set.

pybool_ir.index.pubmed.parse_medline_date(date_str: str) Tuple[int, int, int]#

Parse a date string from a Medline record. The returned value is a tuple of (year, month, day).

The following is a needlessly complicated, yet accurate implementation of how Pubmed handles the publication date of documents. For more information about the nuances of this technical marvel, see: https://pubmed.ncbi.nlm.nih.gov/help/#dp

pybool_ir.index.pubmed.parse_pubmed_article_node(element: Element) PubmedArticle#

Parse a PubmedArticle node from a Pubmed XML element.