To Case or Not to Case: An Empirical Study in Learned Sparse Retrieval
Emmanouil Georgios Lionis, Jia-Huei Ju, Angelos Nalmpantis, Casper Thuis, Sean MacAvaney, Andrew Yates

TL;DR
This study evaluates the impact of cased versus uncased backbone models on Learned Sparse Retrieval, finding that cased models perform worse unless text is lowercased, which makes them behave like uncased models.
Contribution
It provides the first systematic comparison of cased and uncased models in LSR, showing that lowercasing aligns cased models' performance with uncased ones, expanding their applicability.
Findings
Cased models perform worse than uncased models in LSR.
Lowercasing cased models restores their performance to uncased levels.
Cased models effectively behave as uncased models after lowercasing.
Abstract
Learned Sparse Retrieval (LSR) methods construct sparse lexical representations of queries and documents that can be efficiently searched using inverted indexes. Existing LSR approaches have relied almost exclusively on uncased backbone models, whose vocabularies exclude case-sensitive distinctions, thereby reducing vocabulary mismatch. However, the most recent state-of-the-art language models are only available in cased versions. Despite this shift, the impact of backbone model casing on LSR has not been studied, potentially posing a risk to the viability of the method going forward. To fill this gap, we systematically evaluate paired cased and uncased versions of the same backbone models across multiple datasets to assess their suitability for LSR. Our findings show that LSR models with cased backbone models by default perform substantially worse than their uncased counterparts;…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
