GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Iordanis Fostiropoulos; Muhammad Rafay Azhar; Abdalaziz Sawwan; Boyu Fang; Yuchen Liu; Jiayi Liu; Hanchao Yu; Qi Guo; Jianyu Wang; Fei Liu; Xiangjun Fan

arXiv:2603.29112·cs.AI·April 1, 2026

GISTBench: Evaluating LLM User Understanding via Evidence-Based Interest Verification

Iordanis Fostiropoulos, Muhammad Rafay Azhar, Abdalaziz Sawwan, Boyu Fang, Yuchen Liu, Jiayi Liu, Hanchao Yu, Qi Guo, Jianyu Wang, Fei Liu, Xiangjun Fan

PDF

1 Datasets

TL;DR

GISTBench is a new benchmark designed to evaluate LLMs' ability to understand user interests from interaction histories, focusing on interest verification rather than item prediction accuracy.

Contribution

It introduces novel metrics for interest groundedness and specificity, along with a synthetic dataset based on real user interactions for comprehensive LLM evaluation.

Findings

01

Current LLMs show limited ability to accurately count and attribute engagement signals.

02

The benchmark reveals performance bottlenecks in LLMs' understanding of heterogeneous interaction data.

03

The dataset's fidelity is validated against user surveys.

Abstract

We introduce GISTBench, a benchmark for evaluating Large Language Models' (LLMs) ability to understand users from their interaction histories in recommendation systems. Unlike traditional RecSys benchmarks that focus on item prediction accuracy, our benchmark evaluates how well LLMs can extract and verify user interests from engagement data. We propose two novel metric families: Interest Groundedness (IG), decomposed into precision and recall components to separately penalize hallucinated interest categories and reward coverage, and Interest Specificity (IS), which assesses the distinctiveness of verified LLM-predicted user profiles. We release a synthetic dataset constructed on real user interactions on a global short-form video platform. Our dataset contains both implicit and explicit engagement signals and rich textual descriptions. We validate our dataset fidelity against user…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

facebook/gistbench
dataset· 181 dl
181 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.