Flames: Benchmarking Value Alignment of LLMs in Chinese

Kexin Huang; Xiangyang Liu; Qianyu Guo; Tianxiang Sun; Jiawei Sun,; Yaru Wang; Zeyang Zhou; Yixu Wang; Yan Teng; Xipeng Qiu; Yingchun Wang; Dahua; Lin

arXiv:2311.06899·cs.CL·May 31, 2024·2 cites

Flames: Benchmarking Value Alignment of LLMs in Chinese

Kexin Huang, Xiangyang Liu, Qianyu Guo, Tianxiang Sun, Jiawei Sun,, Yaru Wang, Zeyang Zhou, Yixu Wang, Yan Teng, Xipeng Qiu, Yingchun Wang, Dahua, Lin

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

The paper introduces Flames, a comprehensive benchmark for evaluating Chinese LLMs' alignment with human values, revealing current models' safety gaps and proposing a new evaluation tool.

Contribution

It presents Flames, a novel benchmark incorporating Chinese cultural values and complex adversarial prompts, to better assess LLMs' safety and moral alignment.

Findings

01

All evaluated LLMs perform poorly on Flames, especially in safety and fairness.

02

Existing benchmarks are insufficient for uncovering safety vulnerabilities.

03

A lightweight scoring method enables efficient multi-dimensional evaluation.

Abstract

The widespread adoption of large language models (LLMs) across various regions underscores the urgent need to evaluate their alignment with human values. Current benchmarks, however, fall short of effectively uncovering safety vulnerabilities in LLMs. Despite numerous models achieving high scores and 'topping the chart' in these evaluations, there is still a significant gap in LLMs' deeper alignment with human values and achieving genuine harmlessness. To this end, this paper proposes a value alignment benchmark named Flames, which encompasses both common harmlessness principles and a unique morality dimension that integrates specific Chinese values such as harmony. Accordingly, we carefully design adversarial prompts that incorporate complex scenarios and jailbreaking methods, mostly with implicit malice. By prompting 17 mainstream LLMs, we obtain model responses and rigorously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aiflames/flames
noneOfficial

Models

🤗
CaasiHUANG/flames-scorer
model· 19 dl· ♡ 5
19 dl♡ 5

Datasets

PKU-Alignment/Flames-1k-Chinese
dataset· 26 dl
26 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Adversarial Robustness in Machine Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Byte Pair Encoding · Dropout · Softmax · Adam · Label Smoothing · Absolute Position Encodings