Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction
Changyue Jiang, Xudong Pan, Min Yang

TL;DR
This paper introduces Thought-Aligner, a lightweight module that dynamically corrects risky thoughts in LLM-based agents, significantly improving safety without affecting the core agent framework.
Contribution
We propose Thought-Aligner, a resource-efficient thought correction module that enhances behavioral safety in LLM agents by correcting high-risk thoughts during execution.
Findings
Safety improved from 50% to 90% across benchmarks.
Thought-Aligner operates with less than 100ms latency.
Model trained on 5,000 instruction and thought pairs.
Abstract
LLM-based autonomous agents possess capabilities such as reasoning, tool invocation, and environment interaction, enabling the execution of complex multi-step tasks. The internal reasoning process, i.e., thought, of behavioral trajectory significantly influences tool usage and subsequent actions but can introduce potential risks. Even minor deviations in the agent's thought may trigger cascading effects leading to irreversible safety incidents. To address the safety alignment challenges in long-horizon behavioral trajectories, we propose Thought-Aligner, a plug-in dynamic thought correction module. Utilizing a lightweight and resource-efficient model, Thought-Aligner corrects each high-risk thought on the fly before each action execution. The corrected thought is then reintroduced to the agent, ensuring safer subsequent decisions and tool interactions. Importantly, Thought-Aligner…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Autonomous Vehicle Technology and Safety · Safety Systems Engineering in Autonomy
MethodsContrastive Learning
