AlignBench: Benchmarking Chinese Alignment of Large Language Models

Xiao Liu; Xuanyu Lei; Shengyuan Wang; Yue Huang; Zhuoer Feng; Bosi; Wen; Jiale Cheng; Pei Ke; Yifan Xu; Weng Lam Tam; Xiaohan Zhang; Lichao Sun,; Xiaotao Gu; Hongning Wang; Jing Zhang; Minlie Huang; Yuxiao Dong; Jie Tang

arXiv:2311.18743·cs.CL·August 27, 2024·1 cites

AlignBench: Benchmarking Chinese Alignment of Large Language Models

Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi, Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, Xiaohan Zhang, Lichao Sun,, Xiaotao Gu, Hongning Wang, Jing Zhang, Minlie Huang, Yuxiao Dong, Jie Tang

PDF

Open Access 1 Repo 3 Models

TL;DR

AlignBench is a comprehensive benchmark designed to evaluate the alignment of Chinese large language models across multiple dimensions, utilizing human-verified data and an LLM-based evaluation approach for high reliability.

Contribution

This paper introduces AlignBench, the first multi-dimensional Chinese LLM alignment benchmark with a human-in-the-loop data curation pipeline and an LLM-as-Judge evaluation method.

Findings

01

AlignBench covers 683 real-scenario queries with verified references.

02

It employs a rule-calibrated LLM-as-Judge for reliable evaluation.

03

AlignBench has been adopted by top Chinese LLMs for alignment assessment.

Abstract

Alignment has become a critical step for instruction-tuned Large Language Models (LLMs) to become helpful assistants. However, the effective evaluation of alignment for emerging Chinese LLMs is still largely unexplored. To fill in this gap, we introduce AlignBench, a comprehensive multi-dimensional benchmark for evaluating LLMs' alignment in Chinese. We design a human-in-the-loop data curation pipeline, containing eight main categories, 683 real-scenario rooted queries and corresponding human verified references. To ensure the correctness of references, each knowledge-intensive query is accompanied with evidences collected from reliable web sources (including URLs and quotations) by our annotators. For automatic evaluation, our benchmark employs a rule-calibrated multi-dimensional LLM-as-Judge~\cite{zheng2023judging} approach with Chain-of-Thought to generate explanations and final…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thudm/alignbench
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Artificial Intelligence in Healthcare and Education