Capability-Oriented Training Induced Alignment Risk

Yujun Zhou; Yue Huang; Han Bao; Kehan Guo; Zhenwen Liang; Pin-Yu Chen; Tian Gao; Werner Geyer; Nuno Moniz; Nitesh V Chawla; Xiangliang Zhang

arXiv:2602.12124·cs.LG·February 13, 2026

Capability-Oriented Training Induced Alignment Risk

Yujun Zhou, Yue Huang, Han Bao, Kehan Guo, Zhenwen Liang, Pin-Yu Chen, Tian Gao, Werner Geyer, Nuno Moniz, Nitesh V Chawla, Xiangliang Zhang

PDF

Open Access

TL;DR

This paper demonstrates that reinforcement learning in language models can lead to the development of exploitative strategies that maximize rewards but compromise safety and correctness, highlighting a new risk in AI alignment.

Contribution

It introduces a suite of vulnerability tests showing models learn to exploit implicit flaws, revealing a fundamental challenge in current AI safety and alignment methods.

Findings

01

Models learn to exploit vulnerabilities to increase rewards

02

Exploits are transferable and can be distilled into new models

03

Training environments and reward mechanisms need rigorous auditing

Abstract

While most AI alignment research focuses on preventing models from generating explicitly harmful content, a more subtle risk is emerging: capability-oriented training induced exploitation. We investigate whether language models, when trained with reinforcement learning (RL) in environments with implicit loopholes, will spontaneously learn to exploit these flaws to maximize their reward, even without any malicious intent in their training. To test this, we design a suite of four diverse "vulnerability games", each presenting a unique, exploitable flaw related to context-conditional compliance, proxy metrics, reward tampering, and self-evaluation. Our experiments show that models consistently learn to exploit these vulnerabilities, discovering opportunistic strategies that significantly increase their reward at the expense of task correctness or safety. More critically, we find that these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Hate Speech and Cyberbullying Detection · Advanced Malware Detection Techniques