MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents
Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck

TL;DR
MANTRA is a framework that automatically creates formal, machine-checkable benchmarks from natural-language manuals to evaluate tool-using LLM agents' compliance, scaling to complex, long-horizon tasks.
Contribution
It introduces a novel automated method for synthesizing compliance benchmarks using SMT validation, supporting arbitrary domains and long manuals with minimal human effort.
Findings
Built a benchmark suite with 285 tasks across 6 domains.
Compliance checks are richer and enforce stronger constraints than existing benchmarks.
Granular checks help debug agent failure modes.
Abstract
Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
