SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

Liang Xu; Anqi Li; Lei Zhu; Hang Xue; Changtai Zhu; Kangkang Zhao,; Haonan He; Xuanwei Zhang; Qiyue Kang; Zhenzhong Lan

arXiv:2307.15020·cs.CL·July 28, 2023·23 cites

SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark

Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao,, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan

PDF

Open Access

TL;DR

SuperCLUE is a comprehensive Chinese benchmark for large language models that evaluates performance across diverse real-world tasks and user preferences, highlighting the limitations of accuracy metrics alone.

Contribution

This paper introduces SuperCLUE, the first Chinese benchmark integrating user queries, open-ended, and closed-ended questions to better assess LLMs in practical scenarios.

Findings

01

Accuracy on closed-ended questions alone is insufficient to reflect human preferences.

02

Open-ended and closed-ended questions together better predict user preferences.

03

GPT-4 reliably evaluates human preferences in Chinese LLM tasks.

Abstract

Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare

MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Linear Layer · Softmax · Layer Normalization · Dense Connections · Dropout · Focus · Position-Wise Feed-Forward Layer