CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Tae Soo Kim; Yoonjoo Lee; Yoonah Park; Jiho Kim; Young-Ho Kim; Juho Kim

arXiv:2508.01674·cs.CL·August 8, 2025

CUPID: Evaluating Personalized and Contextualized Alignment of LLMs from Interactions

Tae Soo Kim, Yoonjoo Lee, Yoonah Park, Jiho Kim, Young-Ho Kim, Juho Kim

PDF

1 Models 3 Datasets

TL;DR

CUPID is a benchmark designed to evaluate how well large language models can infer and adapt to users' dynamic preferences based on multi-turn interaction histories, highlighting current limitations in contextual personalization.

Contribution

This work introduces CUPID, a new benchmark with 756 interaction sessions to assess LLMs' ability to infer user preferences from context and interactions.

Findings

01

State-of-the-art LLMs have under 50% precision in inferring preferences.

02

LLMs struggle to identify relevant past context for new requests.

03

Current models achieve only 65% recall in preference inference.

Abstract

Personalization of Large Language Models (LLMs) often assumes users hold static preferences that reflect globally in all tasks. In reality, humans hold dynamic preferences that change depending on the context. As users interact with an LLM in various contexts, they naturally reveal their contextual preferences, which a model must infer and apply in future contexts to ensure alignment. To assess this, we introduce CUPID, a benchmark of 756 human-curated interaction session histories between users and LLM-based chat assistants. In each interaction session, the user provides a request in a specific context and expresses their preference through multi-turn feedback. Given a new user request and prior interaction sessions, our benchmark assesses whether LLMs can infer the preference relevant to this request and generate a response that satisfies this preference. With CUPID, we evaluated 10…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
kixlab/prefmatcher-7b
model· 3 dl· ♡ 1
3 dl♡ 1

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.