Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

Mingqiao Zhang; Qiyao Peng; Yumeng Wang; Chunyuan Liu; Hongtao Liu

arXiv:2602.13626·cs.LG·March 10, 2026

Benchmark Leakage Trap: Can We Trust LLM-based Recommendation?

Mingqiao Zhang, Qiyao Peng, Yumeng Wang, Chunyuan Liu, Hongtao Liu

PDF

Open Access

TL;DR

This paper uncovers and analyzes the impact of benchmark data leakage in LLM-based recommender systems, demonstrating how it can artificially inflate or degrade performance metrics depending on domain relevance, thus questioning the reliability of current evaluation methods.

Contribution

It identifies and experimentally validates the issue of benchmark data leakage in LLM-based recommendation, highlighting its effects on performance measurement and evaluation reliability.

Findings

01

Leaked domain-relevant data causes inflated performance metrics.

02

Leaked domain-irrelevant data can reduce recommendation accuracy.

03

Data leakage significantly affects the trustworthiness of LLM-based recommendation evaluations.

Abstract

The expanding integration of Large Language Models (LLMs) into recommender systems poses critical challenges to evaluation reliability. This paper identifies and investigates a previously overlooked issue: benchmark data leakage in LLM-based recommendation. This phenomenon occurs when LLMs are exposed to and potentially memorize benchmark datasets during pre-training or fine-tuning, leading to artificially inflated performance metrics that fail to reflect true model performance. To validate this phenomenon, we simulate diverse data leakage scenarios by conducting continued pre-training of foundation models on strategically blended corpora, which include user-item interactions from both in-domain and out-of-domain sources. Our experiments reveal a dual-effect of data leakage: when the leaked data is domain-relevant, it induces substantial but spurious performance gains, misleadingly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Recommender Systems and Techniques