A Neurosymbolic Approach to Natural Language Formalization and Verification
Sam Bayless, Stefano Buliani, Darion Cassel, Byron Cook, Duncan Clough, R\'emi Delmas, Nafi Diallo, Ferhat Erata, Nick Feng, Dimitra Giannakopoulou, Aman Goel, Aditya Gokhale, Joe Hendrix, Marc Hudak, Dejan Jovanovi\'c, Andrew M. Kent, Benjamin Kiesl-Reiter, Jeffrey J. Kuna

TL;DR
This paper introduces a neurosymbolic framework combining large language models and formal verification to improve the logical correctness and auditability of natural language policies, especially in regulated industries.
Contribution
It presents a novel two-stage neurosymbolic approach that formalizes and verifies natural language policies with high accuracy and auditability, addressing limitations of stochastic LLMs.
Findings
Achieves over 99% soundness in formalization correctness
Enables multiple redundant formalizations for critical accuracy
Produces auditable logical artifacts for verification and improvement
Abstract
Large Language Models perform well at natural language interpretation and reasoning, but their inherent stochasticity limits their adoption in regulated industries like finance and healthcare that operate under strict policies. To address this limitation, we present a two-stage neurosymbolic framework that (1) uses LLMs with optional human guidance to formalize natural language policies, allowing fine-grained control of the formalization process, and (2) uses inference-time autoformalization to validate logical correctness of natural language statements against those policies. When correctness is paramount, we perform multiple redundant formalization steps at inference time, cross checking the formalizations for semantic equivalence. Our benchmarks demonstrate that our approach exceeds 99% soundness, indicating a near-zero false positive rate in identifying logical validity. Our…
Peer Reviews
Decision·Submitted to ICLR 2026
- The proposed LRG framework effectively integrates a human-in-the-loop refinement process in the Policy Model Creator (PMC), enhancing the accuracy and interpretability of autoformalized policy models, which I think should be an essential step for ensuring the soundness of future verification. - The Answer Verifier (AV) introduces redundant translation and semantic voting across multiple LLMs to improve both syntactic and semantic reliability during natural language-to-logic formalization. -
- Despite achieving 99.2% soundness, LRG’s recall remains low (15.6%). The experiments primarily involve moderately complex domains (e.g., park admission, airline refund). Evaluating LRG on domains with nested logical dependencies—such as healthcare or financial compliance—would better demonstrate its scalability and generality. Nonetheless, this limitation may represent a natural boundary of current verification paradigms and could be left for future work. - The comparison with prior methods (
- Provides logical justification / counterexamples for every verdict - Improves LLM responses by iteratively refining answers - Standardized logical formalism with SMT-LIB
- Recall is low: A Conservative guardrail can block many correct responses - Human vetting is time-intensive for long/complex policies - Formalization errors in PMC propagate into AV despite validations - Numerical tables, implicit temporal constraints, and nested logic are still hard for autoformalization My primary concern is that the core technical contributions seem closer to a well-engineered pipeline than a novel methodology advance. The system components, like natural language autoformal
1. Verifiable guardrails for regulated industries are a timely and practically relevant challenge. 2. The combination of autoformalization, redundant translation, and SMT verification is well-motivated and well-implemented. 3. 99.2% soundness (2.5% FPR) represents strong performance on the primary metric, important for safety-critical applications. 4. Human-in-the-loop vetting, multiple finding categories, and actionable feedback demonstrate thoughtful system architecture. 5. The system produces
In its current form, the paper does not meet the bar to be published as a full paper at ICLR. Please consider the following as constructive feedback on how to improve it during the rebuttal process. I will increase my ratings if you can conduct all these experiments. 1. **Limited Evaluation Scope**: The paper evaluates primarily on ONE custom modified dataset (ConditionalQA-Logic), with only ONE real-world case study (44 test cases). 1. The paper is missing evaluation on LogicNLI (Tian et al
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning in Healthcare · Artificial Intelligence in Healthcare and Education
