The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang; Yida Lu; Junfeng Fang; Junxiao Yang; Shiyao Cui; Hao Zhou; Fandong Meng; Jie Zhou; Hongning Wang; Minlie Huang; Tat-Seng Chua

arXiv:2602.04196·cs.CL·February 5, 2026

The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua

PDF

Open Access

TL;DR

This paper uncovers and systematically studies implicit safety risks during AI training, revealing their prevalence and severity, which have been largely overlooked compared to deployment-time risks.

Contribution

It introduces a taxonomy of training-time safety risks, provides extensive experiments demonstrating their prevalence, and analyzes factors influencing these risks in various training settings.

Findings

01

74.4% of training runs with Llama-3.1-8B-Instruct exhibit risky behaviors

02

Identifies five risk levels and ten risk categories of training-time safety risks

03

Implicit risks also occur in multi-agent training environments

Abstract

Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Reinforcement Learning in Robotics · Ethics and Social Impacts of AI