Flames: Benchmarking Value Alignment of LLMs in Chinese
Kexin Huang, Xiangyang Liu, Qianyu Guo, Tianxiang Sun, Jiawei Sun,, Yaru Wang, Zeyang Zhou, Yixu Wang, Yan Teng, Xipeng Qiu, Yingchun Wang, Dahua, Lin

TL;DR
The paper introduces Flames, a comprehensive benchmark for evaluating Chinese LLMs' alignment with human values, revealing current models' safety gaps and proposing a new evaluation tool.
Contribution
It presents Flames, a novel benchmark incorporating Chinese cultural values and complex adversarial prompts, to better assess LLMs' safety and moral alignment.
Findings
All evaluated LLMs perform poorly on Flames, especially in safety and fairness.
Existing benchmarks are insufficient for uncovering safety vulnerabilities.
A lightweight scoring method enables efficient multi-dimensional evaluation.
Abstract
The widespread adoption of large language models (LLMs) across various regions underscores the urgent need to evaluate their alignment with human values. Current benchmarks, however, fall short of effectively uncovering safety vulnerabilities in LLMs. Despite numerous models achieving high scores and 'topping the chart' in these evaluations, there is still a significant gap in LLMs' deeper alignment with human values and achieving genuine harmlessness. To this end, this paper proposes a value alignment benchmark named Flames, which encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values such as harmony. Accordingly, we carefully design adversarial prompts that incorporate complex scenarios and jailbreaking methods, mostly with implicit malice. By prompting 17 mainstream LLMs, we obtain model responses and rigorously…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Adversarial Robustness in Machine Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Dropout · Softmax · Adam · Label Smoothing · Absolute Position Encodings
