TL;DR
MIRAGE-Bench is a comprehensive benchmark designed to systematically evaluate hallucinations in LLM-based agents, addressing a critical gap by providing a unified, reproducible, and risk-aware assessment framework for their failure modes.
Contribution
The paper introduces MIRAGE-Bench, the first unified, systematic benchmark for measuring hallucinations in interactive LLM agents, including a taxonomy, test case synthesis, and a fine-grained evaluation paradigm.
Findings
Identifies three types of agentic hallucinations: unfaithfulness to instructions, history, or environment.
Provides a reproducible test case synthesis method for isolating decision points.
Develops a risk-aware LLM-as-a-Judge paradigm for scalable evaluation.
Abstract
Hallucinations pose critical risks for large language model (LLM)-based agents, often manifesting as hallucinative actions resulting from fabricated or misinterpreted information within the cognitive context. While recent studies have exposed such failures, existing evaluations remain fragmented and lack a principled testbed. In this paper, we present MIRAGE-Bench--Measuring Illusions in Risky AGEnt settings--the first unified benchmark for eliciting and evaluating hallucinations in interactive LLM-agent scenarios. We begin by introducing a three-part taxonomy to address agentic hallucinations: actions that are unfaithful to (i) task instructions, (ii) execution history, or (iii) environment observations. To analyze, we first elicit such failures by performing a systematic audit of existing agent benchmarks, then synthesize test cases using a snapshot strategy that isolates decision…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper proposes a unified taxonomy of agentic hallucinations that categorizes failures based on unfaithfulness to task instructions, interaction history, and environmental observations. 2. The contextual snapshot strategy addresses non-determinism and setup complexity of full environments, which enables stable and reproducible evaluations without requiring full environment rollouts. 3. The benchmark covers a diverse range of interactive environments, spanning web, OS, software-engineering
1. Relies solely on Claude-3.5-Sonnet as the judge model, which may introduce bias or limit generalizability. The evaluation would be more robust with cross-validation using multiple judge models (e.g., GPT, Gemini) or ablation on judge sensitivity. 2. More advanced state-of-the-art LLMs such as GPT-5, Gemini 2.5 Pro, and Claude-4-Sonnet/Opus are not evaluated. Would models with larger reasoning abilities alleviate agentic hallucinations? The paper lacks an ablation study on models with varying
(1) This paper formally defines hallucinative actions and distinguishes three types of unfaithful behaviors (task instructions, interaction history, and environment observations) thereby extending the notion of hallucination from natural language generation to action-level decision-making in interactive agents. (2) The paper clearly defines key concepts such as hallucinative actions, the snapshot strategy, and the risk-aware LLM-as-a-Judge framework, and presents a logical and easy-to-follow fl
(1) The benchmark focuses on six well-chosen but mainly text- and web-centric settings. Including non-web domains (e.g., embodied or multimodal agents) would improve generality and demonstrate broader applicability. (2) Although the snapshot strategy ensures reproducibility, the paper does not show whether snapshots preserve full contextual fidelity. Experiments comparing snapshot vs. full-trajectory evaluation or perturbation tests would better support this assumption.
+ Fills a critical gap: First systematic benchmark for interactive agent hallucinations—beyond single-turn QA (TruthfulQA, HaloGEN) and success-only agent evals (WebArena, AgentBench). Table 1 clearly shows missing dimensions. + Strong taxonomy: Grounded in ReAct loop; each category maps to real-world risks (e.g., credential leak via fake navigation, Fig 2). + Snapshot innovation: Freezing full context (instruction + history + observation) at hallucination-prone steps eliminates stochasticity
- Human evaluation critically under-specified: Only 160 samples used for judge validation. No annotator expertise reported (e.g., agent safety researchers?). No inter-annotator agreement (Cohen’s κ, Krippendorff’s α). If more detailed information such as a substantially larger validation set documented domain expertise of raters and published inter-rater reliability scores were provided the credibility of the AI-safety assessment would be significantly strengthened. - Multi-turn" claim vs. snap
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
