User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs

Sougata Saha; Monojit Choudhury

arXiv:2507.05266·cs.CL·July 9, 2025

User Behavior Prediction as a Generic, Robust, Scalable, and Low-Cost Evaluation Strategy for Estimating Generalization in LLMs

Sougata Saha, Monojit Choudhury

PDF

Open Access

TL;DR

This paper proposes user behavior prediction as a scalable, robust, and low-cost method to evaluate the generalization of large language models, addressing limitations of traditional task-based assessments.

Contribution

It introduces a novel framework for user behavior prediction as an alternative evaluation strategy and demonstrates its effectiveness on recommendation datasets with multiple LLMs.

Findings

01

GPT-4o outperforms GPT-4o-mini and Llama-3.1-8B-Instruct in user behavior prediction tasks.

02

All models show significant room for improvement, especially Llama.

03

The framework aligns well with empirical results, validating its potential as an evaluation method.

Abstract

Measuring the generalization ability of Large Language Models (LLMs) is challenging due to data contamination. As models grow and computation becomes cheaper, ensuring tasks and test cases are unseen during training phases will become nearly impossible. We argue that knowledge-retrieval and reasoning tasks are not ideal for measuring generalization, as LLMs are not trained for specific tasks. Instead, we propose user behavior prediction, also a key aspect of personalization, as a theoretically sound, scalable, and robust alternative. We introduce a novel framework for this approach and test it on movie and music recommendation datasets for GPT-4o, GPT-4o-mini, and Llama-3.1-8B-Instruct. Results align with our framework's predictions, showing GPT-4o outperforms GPT-4o-mini and Llama, though all models have much room for improvement, especially Llama.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education