Reasoning as an Adaptive Defense for Safety

Taeyoun Kim; Fahim Tajwar; Aditi Raghunathan; Aviral Kumar

arXiv:2507.00971·cs.LG·October 28, 2025

Reasoning as an Adaptive Defense for Safety

Taeyoun Kim, Fahim Tajwar, Aditi Raghunathan, Aviral Kumar

PDF

Open Access 4 Models

TL;DR

This paper introduces TARS, a reinforcement learning method that trains large language models to reason about safety, improving their robustness against harmful prompts and jailbreak attacks by adaptively allocating compute during inference.

Contribution

The paper presents TARS, a novel RL-based training recipe that enhances LLM safety and robustness through adaptive reasoning and carefully designed training strategies.

Findings

01

Models trained with TARS better distinguish safe and unsafe prompts.

02

TARS-trained models show increased robustness to white-box and black-box attacks.

03

Adaptive reasoning leads to improved safety-refusal trade-offs.

Abstract

Reasoning methods that adaptively allocate test-time compute have advanced LLM performance on easy to verify domains such as math and code. In this work, we study how to utilize this approach to train models that exhibit a degree of robustness to safety vulnerabilities, and show that doing so can provide benefits. We build a recipe called $TARS$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion. To build TARS, we identify three critical design choices: (1) a ``lightweight'' warmstart SFT stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as too many refusals, and (3) a reward function to prevent degeneration of reasoning capabilities during training. Models trained with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Software Testing and Debugging Techniques · Security and Verification in Computing