Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

Changyue Jiang; Xudong Pan; Min Yang

arXiv:2505.11063·cs.AI·May 20, 2025

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

Changyue Jiang, Xudong Pan, Min Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces Thought-Aligner, a lightweight module that dynamically corrects risky thoughts in LLM-based agents, significantly improving safety without affecting the core agent framework.

Contribution

We propose Thought-Aligner, a resource-efficient thought correction module that enhances behavioral safety in LLM agents by correcting high-risk thoughts during execution.

Findings

01

Safety improved from 50% to 90% across benchmarks.

02

Thought-Aligner operates with less than 100ms latency.

03

Model trained on 5,000 instruction and thought pairs.

Abstract

LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions but can introduce potential risks. Even minor deviations in the agent's thought may trigger cascading effects leading to irreversible safety incidents. To address the safety alignment challenges in long-horizon behavioral trajectories, we propose Thought-Aligner, a plug-in dynamic thought correction module. Utilizing a lightweight and resource-efficient model, Thought-Aligner corrects each high-risk thought on the fly before each action execution. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. Importantly, Thought-Aligner…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-coai/agent-safetybench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety · Safety Systems Engineering in Autonomy

MethodsContrastive Learning