Reproducible Hybrid Time-Travel Retrieval in Evolving Corpora
Moritz Staudinger, Florina Piroi, Andreas Rauber

TL;DR
This paper introduces a hybrid retrieval system that ensures reproducible ranked lists over evolving document collections by combining fast retrieval with a versioned, time-stamped index to handle dynamic corpora.
Contribution
The paper presents a novel hybrid retrieval approach that guarantees reproducibility of search results in changing collections, addressing limitations of traditional models.
Findings
Reproducible retrieval results over evolving corpora achieved.
Hybrid system maintains original rankings despite collection changes.
Supports time-travel queries in web archives.
Abstract
There are settings in which reproducibility of ranked lists is desirable, such as when extracting a subset of an evolving document corpus for downstream research tasks or in domains such as patent retrieval or in medical systematic reviews, with high reproducibility expectations. However, as global term statistics change when documents change or are added to a corpus, queries using typical ranked retrieval models are not even reproducible for the parts of the document corpus that have not changed. Thus, Boolean retrieval frequently remains the mechanism of choice in such settings. We present a hybrid retrieval system combining Lucene for fast retrieval with a column-store-based retrieval system maintaining a versioned and time-stamped index. The latter component allows re-execution of previously posed queries resulting in the same ranked list and further allows for time-travel queries…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
