A User-Centric Multi-Intent Benchmark for Evaluating Large Language   Models

Jiayin Wang; Fengran Mo; Weizhi Ma; Peijie Sun; Min Zhang; Jian-Yun; Nie

arXiv:2404.13940·cs.CL·September 23, 2024·2 cites

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models

Jiayin Wang, Fengran Mo, Weizhi Ma, Peijie Sun, Min Zhang, Jian-Yun, Nie

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a user-centric benchmark for evaluating large language models based on real-world user scenarios and intents, providing a more practical assessment of their effectiveness in satisfying diverse user needs.

Contribution

It presents the URS dataset of 1,846 real-world use cases, categorizes user intents, and demonstrates that benchmark scores correlate strongly with human preferences.

Findings

01

Benchmark scores align with human preferences (Pearson correlation 0.95 and 0.94).

02

The URS dataset effectively captures authentic user needs.

03

The evaluation method offers a practical way to assess LLMs from a user perspective.

Abstract

Large language models (LLMs) are essential tools that users employ across various scenarios, so evaluating their performance and guiding users in selecting the suitable service is important. Although many benchmarks exist, they mainly focus on specific predefined model abilities, such as world knowledge, reasoning, etc. Based on these ability scores, it is hard for users to determine which LLM best suits their particular needs. To address these issues, we propose to evaluate LLMs from a user-centric perspective and design this benchmark to measure their efficacy in satisfying user needs under distinct intents. Firstly, we collect 1,846 real-world use cases from a user study with 712 participants from 23 countries. This first-hand data helps us understand actual user intents and needs in LLM interactions, forming the User Reported Scenarios (URS) dataset, which is categorized with six…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alice1998/urs
noneOfficial

Videos

A User-Centric Multi-Intent Benchmark for Evaluating Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

Methodstravel james · ALIGN · Focus