A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models
Jiayin Wang, Fengran Mo, Weizhi Ma, Peijie Sun, Min Zhang, Jian-Yun, Nie

TL;DR
This paper introduces a user-centric benchmark for evaluating large language models based on real-world user scenarios and intents, providing a more practical assessment of their effectiveness in satisfying diverse user needs.
Contribution
It presents the URS dataset of 1,846 real-world use cases, categorizes user intents, and demonstrates that benchmark scores correlate strongly with human preferences.
Findings
Benchmark scores align with human preferences (Pearson correlation 0.95 and 0.94).
The URS dataset effectively captures authentic user needs.
The evaluation method offers a practical way to assess LLMs from a user perspective.
Abstract
Large language models (LLMs) are essential tools that users employ across various scenarios, so evaluating their performance and guiding users in selecting the suitable service is important. Although many benchmarks exist, they mainly focus on specific predefined model abilities, such as world knowledge, reasoning, etc. Based on these ability scores, it is hard for users to determine which LLM best suits their particular needs. To address these issues, we propose to evaluate LLMs from a user-centric perspective and design this benchmark to measure their efficacy in satisfying user needs under distinct intents. Firstly, we collect 1,846 real-world use cases from a user study with 712 participants from 23 countries. This first-hand data helps us understand actual user intents and needs in LLM interactions, forming the User Reported Scenarios (URS) dataset, which is categorized with six…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
Methodstravel james · ALIGN · Focus
