Web-Scale Retrieval Experimentation with chatnoir-pyterrier

Publication
Proceedings of the 47th European Conference on Information Retrieval

The IR community has always aimed to improve the realism of retrieval experiments by increasing the size of the document collections. As collection sizes grow from megabytes to giga-, tera-, and maybe soon petabytes, IR labs are challenged to keep pace. Herein, we describe our work on integrating ChatNoir with ir_datasets and PyTerrier to create chatnoir-pyterrier, a Python package for using ChatNoir in multi-stage pipelines. ChatNoir provides BM25-based first-stage retrieval on all ClueWeb crawls and all MS MARCO variants with a collective index size of about 20TB. This improves inclusivity by lowering the barrier to entry for web-scale IR, and reduces redundant first-stage indexing overhead across IR labs. We show how chatnoir-pyterrier simplifies a wide range of re-ranking approaches and facilitates retrieval-augmented generation setups against large corpora.