AlpsBench: An LLM Personalization Benchmark for Real-Dialogue Memorization and Preference Alignment
Jianfei Xiao, Xiang Yu, Chengbing Wang, Wuqiang Zheng, Xinyu Lin, Kaining Liu, Hongxun Ding, Yang Zhang, Wenjie Wang, Fuli Feng, Xiangnan He

TL;DR
AlpsBench is a new benchmark derived from real-world dialogues to evaluate LLM personalization, focusing on memory management tasks like extraction, updating, retrieval, and utilization.
Contribution
It introduces a realistic, structured benchmark for LLM personalization based on real dialogues, addressing limitations of synthetic data and evaluating the entire memory lifecycle.
Findings
Models struggle to extract latent user traits reliably.
Memory updating performance plateaus even in strong models.
Retrieval accuracy drops with large distractor pools.
Abstract
As Large Language Models (LLMs) evolve into lifelong AI assistants, LLM personalization has become a critical frontier. However, progress is currently bottlenecked by the absence of a gold-standard evaluation benchmark. Existing benchmarks either overlook personalized information management that is critical for personalization or rely heavily on synthetic dialogues, which exhibit an inherent distribution gap from real-world dialogue. To bridge this gap, we introduce AlpsBench, An LLM PerSonalization benchmark derived from real-world human-LLM dialogues. AlpsBench comprises 2,500 long-term interaction sequences curated from WildChat, paired with human-verified structured memories that encapsulate both explicit and implicit personalization signals. We define four pivotal tasks - personalized information extraction, updating, retrieval, and utilization - and establish protocols to evaluate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
