Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal; Gurdit Siyan; Yash Pandya; Joykirat Singh; Akshay Nambi; Ahmed Awadallah

arXiv:2603.03205·cs.CL·March 4, 2026

Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use

Aradhye Agarwal, Gurdit Siyan, Yash Pandya, Joykirat Singh, Akshay Nambi, Ahmed Awadallah

PDF

Open Access

TL;DR

This paper introduces MOSAIC, a framework that explicitly aligns agentic language models for safe multi-step tool use by making safety decisions explicit, improving safety and robustness across various tasks and models.

Contribution

MOSAIC is a novel post-training alignment method that structures safety as explicit, learnable decisions within an inference loop, addressing limitations of existing methods in agentic settings.

Findings

01

Reduces harmful behavior by up to 50%

02

Increases harmful-task refusal by over 20% on injection attacks

03

Cuts privacy leakage while maintaining or improving benign task performance

Abstract

Agentic language models operate in a fundamentally different safety regime than chat models: they must plan, call tools, and execute long-horizon actions where a single misstep, such as accessing files or entering credentials, can cause irreversible harm. Existing alignment methods, largely optimized for static generation and task completion, break down in these settings due to sequential decision-making, adversarial tool feedback, and overconfident intermediate reasoning. We introduce MOSAIC, a post-training framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable. MOSAIC structures inference as a plan, check, then act or refuse loop, with explicit safety reasoning and refusal as first-class actions. To train without trajectory-level labels, we use preference-based reinforcement learning with pairwise trajectory comparisons, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Topic Modeling · Explainable Artificial Intelligence (XAI)