SuperCLUE: A Comprehensive Chinese Large Language Model Benchmark
Liang Xu, Anqi Li, Lei Zhu, Hang Xue, Changtai Zhu, Kangkang Zhao,, Haonan He, Xuanwei Zhang, Qiyue Kang, Zhenzhong Lan

TL;DR
SuperCLUE is a comprehensive Chinese benchmark for large language models that evaluates performance across diverse real-world tasks and user preferences, highlighting the limitations of accuracy metrics alone.
Contribution
This paper introduces SuperCLUE, the first Chinese benchmark integrating user queries, open-ended, and closed-ended questions to better assess LLMs in practical scenarios.
Findings
Accuracy on closed-ended questions alone is insufficient to reflect human preferences.
Open-ended and closed-ended questions together better predict user preferences.
GPT-4 reliably evaluates human preferences in Chinese LLM tasks.
Abstract
Large language models (LLMs) have shown the potential to be integrated into human daily lives. Therefore, user preference is the most critical criterion for assessing LLMs' performance in real-world scenarios. However, existing benchmarks mainly focus on measuring models' accuracy using multi-choice questions, which limits the understanding of their capabilities in real applications. We fill this gap by proposing a comprehensive Chinese benchmark SuperCLUE, named after another popular Chinese LLM benchmark CLUE. SuperCLUE encompasses three sub-tasks: actual users' queries and ratings derived from an LLM battle platform (CArena), open-ended questions with single and multiple-turn dialogues (OPEN), and closed-ended questions with the same stems as open-ended single-turn ones (CLOSE). Our study shows that accuracy on closed-ended questions is insufficient to reflect human preferences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Machine Learning in Healthcare
MethodsMulti-Head Attention · Attention Is All You Need · Byte Pair Encoding · Linear Layer · Softmax · Layer Normalization · Dense Connections · Dropout · Focus · Position-Wise Feed-Forward Layer
