Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Zhiyu Ni; Zheng Liang; Liangcheng Song; Chenrui Cao; Xian Zhang; Alberto Sangiovanni-Vincentelli; Pierluigi Nuzzo

arXiv:2603.17233·cs.AI·March 30, 2026

Draft-and-Prune: Improving the Reliability of Auto-formalization for Logical Reasoning

Zhiyu Ni, Zheng Liang, Liangcheng Song, Chenrui Cao, Xian Zhang, Alberto Sangiovanni-Vincentelli, Pierluigi Nuzzo

PDF

TL;DR

Draft-and-Prune enhances auto-formalization for logical reasoning by generating multiple plans, pruning contradictory formalizations, and aggregating results, leading to significant accuracy improvements across benchmarks.

Contribution

It introduces a novel inference-time framework that improves the reliability of auto-formalization through diversity, verification, and majority voting, without requiring extra supervision.

Findings

01

Achieves 78.43% accuracy on AR-LSAT with GPT-4.

02

Attains near-ceiling performance on PrOntoQA and LogicalDeduction.

03

Substantially outperforms previous AF baselines like MAD-LOGIC and CLOVER.

Abstract

Auto-formalization (AF) translates natural-language reasoning problems into solver-executable programs, enabling symbolic solvers to perform sound logical deduction. In practice, however, AF pipelines are currently brittle: programs may fail to execute, or execute but encode incorrect semantics. While prior work largely mitigates syntactic failures via repairs based on solver feedback, reducing semantics failures remains a major bottleneck. We propose Draft-and-Prune (D&P), an inference-time framework that improves AF-based logical reasoning via diversity and verification. D&P first drafts multiple natural-language plans and conditions program generation on them. It further prunes executable but contradictory or ambiguous formalizations, and aggregates predictions from surviving paths via majority voting. Across four representative benchmarks (AR-LSAT, ProofWriter, PrOntoQA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.