Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback
Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li, Binwu Wang, Longyue Wang, Chenyang Lyu, Guanhua Chen

TL;DR
This paper introduces TRUST-Bench and VISTA-Guard to evaluate and improve the security of LLM agents against malicious tool feedback, emphasizing trust dynamics over static prompts.
Contribution
It presents a new benchmark, TRUST-Bench, and a risk scoring framework, VISTA-Guard, to detect and defend against cognitive poisoning in tool-using LLM agents.
Findings
VISTA-Guard achieves 84.2 in-domain risk score.
Trajectory-aware scoring outperforms prompt-centric heuristics.
Methods optimizing only safety or utility collapse to zero.
Abstract
Tool-using LLM agents increasingly rely on external tools to make consequential decisions, yet most existing agent-security benchmarks and defenses implicitly assume that tool feedback is trustworthy once a tool has been selected. We study a different failure mode, cognitive poisoning, in which a malicious tool behaves plausibly during exploration, accumulates trust through benign-looking feedback, and becomes harmful only when hidden state conditions align with the final executable action. To study this setting, we construct TRUST-Bench, a task-conditioned benchmark of 1,970 hidden-trigger tool-compromise episodes with matched safe controls, introduce an asymmetric penalty metric, GuardedJoint, to better reflect real deployment risk, and present VISTA-Guard, a backbone-agnostic framework for final-action risk scoring. The core idea is to abstract multi-step tool interaction into…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
