Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Lecheng Yan; Ruizhe Li; Xicheng Han; Wenxi Li; Binwu Wang; Longyue Wang; Chenyang Lyu; Guanhua Chen

arXiv:2605.17453·cs.CR·May 19, 2026

Trust No Tool: Evaluating and Defending LLM Agents under Untrusted Tool Feedback

Lecheng Yan, Ruizhe Li, Xicheng Han, Wenxi Li, Binwu Wang, Longyue Wang, Chenyang Lyu, Guanhua Chen

PDF

TL;DR

This paper introduces TRUST-Bench and VISTA-Guard to evaluate and improve the security of LLM agents against malicious tool feedback, emphasizing trust dynamics over static prompts.

Contribution

It presents a new benchmark, TRUST-Bench, and a risk scoring framework, VISTA-Guard, to detect and defend against cognitive poisoning in tool-using LLM agents.

Findings

01

VISTA-Guard achieves 84.2 in-domain risk score.

02

Trajectory-aware scoring outperforms prompt-centric heuristics.

03

Methods optimizing only safety or utility collapse to zero.

Abstract

Tool-using LLM agents increasingly rely on external tools to make consequential decisions, yet most existing agent-security benchmarks and defenses implicitly assume that tool feedback is trustworthy once a tool has been selected. We study a different failure mode, cognitive poisoning, in which a malicious tool behaves plausibly during exploration, accumulates trust through benign-looking feedback, and becomes harmful only when hidden state conditions align with the final executable action. To study this setting, we construct TRUST-Bench, a task-conditioned benchmark of 1,970 hidden-trigger tool-compromise episodes with matched safe controls, introduce an asymmetric penalty metric, GuardedJoint, to better reflect real deployment risk, and present VISTA-Guard, a backbone-agnostic framework for final-action risk scoring. The core idea is to abstract multi-step tool interaction into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.