MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

Ashwani Anand; Ivi Chatzi; Ritam Raha; Anne-Kathrin Schmuck

arXiv:2605.06334·cs.CL·May 8, 2026

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents

Ashwani Anand, Ivi Chatzi, Ritam Raha, Anne-Kathrin Schmuck

PDF

TL;DR

MANTRA is a framework that automatically creates formal, machine-checkable benchmarks from natural-language manuals to evaluate tool-using LLM agents' compliance, scaling to complex, long-horizon tasks.

Contribution

It introduces a novel automated method for synthesizing compliance benchmarks using SMT validation, supporting arbitrary domains and long manuals with minimal human effort.

Findings

01

Built a benchmark suite with 285 tasks across 6 domains.

02

Compliance checks are richer and enforce stronger constraints than existing benchmarks.

03

Granular checks help debug agent failure modes.

Abstract

Tool-using large language model (LLM) agents are increasingly deployed in settings where their reliable behavior is governed by strict procedural manuals. Ensuring that such agents comply with the rules from these manuals is challenging, as they are typically written for humans in natural language while agent behavior manifests as an execution trace of tool calls. Existing evaluations of LLM agents rely on manually constructed benchmarks or LLM-based judges, which either do not scale or lack reliability for complex, long-horizon manuals. To overcome these limitations, we present MANTRA, a framework for automatically synthesizing machine-checkable compliance benchmarks from natural-language manuals and tool schemas. MANTRA independently generates (i) a symbolic world model capturing procedural dependencies, and (ii) a set of trace-level compliance checks for a given task, and validates…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.