FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs'   Responsiveness to Human Feedback

Youquan Li; Miao Zheng; Fan Yang; Guosheng Dong; Bin Cui; Weipeng; Chen; Zenan Zhou; Wentao Zhang

arXiv:2410.09412·cs.CL·February 18, 2025

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback

Youquan Li, Miao Zheng, Fan Yang, Guosheng Dong, Bin Cui, Weipeng, Chen, Zenan Zhou, Wentao Zhang

PDF

Open Access 1 Repo 1 Video

TL;DR

FB-Bench is a comprehensive multi-task benchmark designed to evaluate how well large language models respond to nuanced human feedback in realistic, multi-turn Chinese dialogue scenarios, highlighting current strengths and gaps.

Contribution

This paper introduces FB-Bench, a detailed multi-task benchmark for assessing LLM responsiveness to human feedback in complex, real-world interactions, especially in Chinese language contexts.

Findings

01

Significant performance variation across models and scenarios.

02

Task type and feedback quality greatly influence responsiveness.

03

Current models show both strengths and notable limitations.

Abstract

Human feedback is crucial in the interactions between humans and Large Language Models (LLMs). However, existing research primarily focuses on benchmarking LLMs in single-turn dialogues. Even in benchmarks designed for multi-turn dialogues, the user inputs are often independent, neglecting the nuanced and complex nature of human feedback within real-world usage scenarios. To fill this research gap, we introduce FB-Bench, a fine-grained, multi-task benchmark designed to evaluate LLMs' responsiveness to human feedback under real-world usage scenarios in Chinese. Drawing from the two main interaction scenarios, FB-Bench comprises 591 meticulously curated samples, encompassing eight task types, five deficiency types of response, and nine feedback types. We extensively evaluate a broad array of popular LLMs, revealing significant variations in their performance across different interaction…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pku-baichuan-mlsystemlab/fb-bench
noneOfficial

Videos

FB-Bench: A Fine-Grained Multi-Task Benchmark for Evaluating LLMs' Responsiveness to Human Feedback· underline

Taxonomy

TopicsEducational Technology and Assessment