Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains

Yuan Gao; Zhigang Liu; Xinyu Yao; Bo Chen; Xiaobing Zhao

arXiv:2601.13137·cs.CL·January 23, 2026

Adversarial Alignment: Ensuring Value Consistency in Large Language Models for Sensitive Domains

Yuan Gao, Zhigang Liu, Xinyu Yao, Bo Chen, Xiaobing Zhao

PDF

Open Access

TL;DR

This paper introduces an adversarial alignment framework to improve value consistency in large language models for sensitive domains, using adversarial training and a new bilingual evaluation dataset.

Contribution

It proposes a novel adversarial training method and trains a Value-Consistent LLM for sensitive domains, with bilingual evaluation, demonstrating improved performance over existing models.

Findings

01

VC-LLM outperforms mainstream models in Chinese and English tests

02

Adversarial training enhances value consistency in LLMs

03

Constructed bilingual evaluation dataset for sensitive domain assessment

Abstract

With the wide application of large language models (LLMs), the problems of bias and value inconsistency in sensitive domains have gradually emerged, especially in terms of race, society and politics. In this paper, we propose an adversarial alignment framework, which enhances the value consistency of the model in sensitive domains through continued pre-training, instruction fine-tuning and adversarial training. In adversarial training, we use the Attacker to generate controversial queries, the Actor to generate responses with value consistency, and the Critic to filter and ensure response quality. Furthermore, we train a Value-Consistent Large Language Model, VC-LLM, for sensitive domains, and construct a bilingual evaluation dataset in Chinese and English. The experimental results show that VC-LLM performs better than the existing mainstream models in both Chinese and English tests,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning · Topic Modeling