Semi-Parametric Retrieval via Binary Bag-of-Tokens Index
Jiawei Zhou, Li Dong, Furu Wei, Lei Chen

TL;DR
This paper introduces SiDR, a bi-encoder retrieval framework that decouples index from neural parameters, enabling efficient, low-cost, and effective retrieval with both embedding and tokenization-based indexes, outperforming existing methods.
Contribution
The paper presents SiDR, a novel retrieval framework that combines parametric and non-parametric indexing for improved efficiency and effectiveness across multiple benchmarks.
Findings
SiDR outperforms neural retrievers with embedding indexes.
Tokenization-based SiDR reduces indexing cost to traditional methods.
Late parametric mechanism matches BM25 index time while improving effectiveness.
Abstract
Information retrieval has transitioned from standalone systems into essential components across broader applications, with indexing efficiency, cost-effectiveness, and freshness becoming increasingly critical yet often overlooked. In this paper, we introduce SemI-parametric Disentangled Retrieval (SiDR), a bi-encoder retrieval framework that decouples retrieval index from neural parameters to enable efficient, low-cost, and parameter-agnostic indexing for emerging use cases. Specifically, in addition to using embeddings as indexes like existing neural retrieval methods, SiDR supports a non-parametric tokenization index for search, achieving BM25-like indexing complexity with significantly better effectiveness. Our comprehensive evaluation across 16 retrieval benchmarks demonstrates that SiDR outperforms both neural and term-based retrieval baselines under the same indexing workload: (i)…
Peer Reviews
Decision·ICLR 2025 Poster
The main strength of the paper lies in the static representation of documents. As far as I know, this is the first model that relies on a simple index structure, that of representing the documents as a binary bag-of-token. This allows for potentially fast retrieval (although this can be debated, see weaknesses) of potential candidates that have then to be re-ranked.
One of the main weaknesses is related to the number of collisions (i.e. number of documents for one token) increases with the size of the collection. It is unclear how this approach performs when the number of documents, their length, or both, increase. It would be important thus to investigate using the model on a larger collection (e.g. MS-Marco). Another point is that the authors state (l. 329) that SOA training "techniques are orthogonal to the retrieval model and have not been applied in o
The paper is well motivated. The problems identified with constructing and refreshing vector based document indices are consistent with prevalent challenges in the industry. While most SOTA retrieval systems, especially those used in the industry, are hybrid, re-using representations across the two forms of retrieval seems novel. There is a good set of baselines, though the most relevant baselines are those that perform hybrid retrieval with vector based and token based approaches. The paper wil
One of the strengths of bag-of-token based retrieval is that it is easily interpretable. The paper misses an opportunity to demonstrate how going from a parametric representation (which is considered semantic retrieval) allows us to tackle the standard problems in IR such as polysemy and synonymy. How do the lack of weights (a la BM25) on the document side affect retrieval? A nit: since the paper uses a hybrid retrieval system, comparison against either vector retrieval alone is not an apples
See the summary for more details. * An interesting approach * A semi-parameteric index retrieval is easy to carry out on a GPU * A substantial evaluation using the BEIR datasets (plus additional QA datasets) * Promising results
After discussion with authors, I have come to a conviction that the paper is generally solid, but presentation can be improved. There are several examples of where the paper is hard to read: 1. No explanation for the need of VDR (I still didn't quite get your explanations in the rebuttal). 2. The whole section of revisiting MLM is very confusing. You say "We provide insights into the consistencies between semi-parametric alignment and masked language model pre-training", but this is already a
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Data Management and Algorithms · Image Retrieval and Classification Techniques
