Web-Scale Retrieval Experimentation with chatnoir-pyterrier

Jan Heinrich Merker, Janek Bevendorff, Maik Fröbe, Tim Hagen, Harrisen Scells, Matti Wiegmann, Benno Stein, Matthias Hagen, Martin Potthast

April, 2025

Type

Conference paper

Publication

Proceedings of the 47th European Conference on Information Retrieval

The IR community has always aimed to improve the realism of retrieval experiments by increasing the size of the document collections. As collection sizes grow from megabytes to giga-, tera-, and maybe soon petabytes, IR labs are challenged to keep pace. Herein, we describe our work on integrating ChatNoir with ir_datasets and PyTerrier to create chatnoir-pyterrier, a Python package for using ChatNoir in multi-stage pipelines. ChatNoir provides BM25-based first-stage retrieval on all ClueWeb crawls and all MS MARCO variants with a collective index size of about 20TB. This improves inclusivity by lowering the barrier to entry for web-scale IR, and reduces redundant first-stage indexing overhead across IR labs. We show how chatnoir-pyterrier simplifies a wide range of re-ranking approaches and facilitates retrieval-augmented generation setups against large corpora.