Diagnosing LLM-based Rerankers in Cold-Start Recommender Systems: Coverage, Exposure and Practical Mitigations
Ekaterina Lemdiasova, Nikita Zmanovskii

TL;DR
This study systematically diagnoses the limitations of LLM-based rerankers in cold-start recommender systems, revealing issues in coverage, exposure bias, and score discrimination, and offers practical mitigation strategies.
Contribution
It identifies key failure modes of LLM rerankers in cold-start scenarios and proposes effective solutions to improve their practical deployment.
Findings
LLM rerankers have low retrieval coverage in candidate generation.
Popularity-based ranking outperforms LLM reranking in accuracy.
Retrieval stage limitations are the main cause of performance gaps.
Abstract
Large language models (LLMs) and cross-encoder rerankers have gained attention for improving recommender systems, particularly in cold-start scenarios where user interaction history is limited. However, practical deployment reveals significant performance gaps between LLM-based approaches and simple baselines. This paper presents a systematic diagnostic study of cross-encoder rerankers in cold-start movie recommendation using the Serendipity-2018 dataset. Through controlled experiments with 500 users across multiple random seeds, we identify three critical failure modes: (1) low retrieval coverage in candidate generation (recall@200 = 0.109 vs. 0.609 for baselines), (2) severe exposure bias with rerankers concentrating recommendations on 3 unique items versus 497 for random baseline, and (3) minimal score discrimination between relevant and irrelevant items (mean difference = 0.098,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
