Agentified Assessment of Logical Reasoning Agents

Zhiyu Ni; Yifeng Xiao; Zheng Liang

arXiv:2603.02788·cs.AI·April 3, 2026

Agentified Assessment of Logical Reasoning Agents

Zhiyu Ni, Yifeng Xiao, Zheng Liang

PDF

TL;DR

This paper introduces a reproducible and auditable framework for evaluating logical reasoning agents, demonstrated through benchmarking an auto-formalization agent on FOL reasoning tasks.

Contribution

It proposes an agentified assessment framework with an assessor agent for standardized, robust evaluation of reasoning agents, including a case study with a formalization agent.

Findings

01

Auto-formalization agent achieves 86.70% accuracy on FOLIO validation set.

02

The framework enables structured failure recording and enforces execution budgets.

03

Outperforms chain-of-thought baseline with 73.89% accuracy.

Abstract

We present a framework for evaluating and benchmarking logical reasoning agents when assessment itself must be reproducible, auditable, and robust to execution failures. Building on agentified assessment, we use an assessor agent to issue tasks, enforce execution budgets, parse outputs, and record structured failure types, while the agent under test only needs to expose a standardized agent-to-agent interface. As a case study, we benchmark an auto-formalization agent for first-order logic (FOL) reasoning on a solver-verified and repaired split of FOLIO. The agent translates natural language premises and conclusions into executable Z3Py programs and employs satisfiability modulo theories (SMT) solving to determine logical entailment. On the cleaned FOLIO validation set, the auto-formalization agent achieves 86.70% accuracy under the assessor protocol, outperforming a chain-of-thought…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.