Divide and Translate: Compositional First-Order Logic Translation and Verification for Complex Logical Reasoning
Hyun Ryu, Gyeongman Kim, Hyemin S. Lee, Eunho Yang

TL;DR
CLOVER introduces a compositional translation method for first-order logic in natural language reasoning, improving accuracy by verifying logical structures and outperforming previous neurosymbolic approaches on multiple benchmarks.
Contribution
The paper presents a novel compositional translation technique and verification algorithms that enhance the accuracy of translating natural language into first-order logic for reasoning tasks.
Findings
Outperforms previous neurosymbolic methods on seven benchmarks
Achieves state-of-the-art results in logical reasoning accuracy
Verification algorithms improve translation reliability
Abstract
Complex logical reasoning tasks require a long sequence of reasoning, which a large language model (LLM) with chain-of-thought prompting still falls short. To alleviate this issue, neurosymbolic approaches incorporate a symbolic solver. Specifically, an LLM only translates a natural language problem into a satisfiability (SAT) problem that consists of first-order logic formulas, and a sound symbolic solver returns a mathematically correct solution. However, we discover that LLMs have difficulties to capture complex logical semantics hidden in the natural language during translation. To resolve this limitation, we propose a Compositional First-Order Logic Translation. An LLM first parses a natural language sentence into newly defined logical dependency structures that consist of an atomic subsentence and its dependents, then sequentially translate the parsed subsentences. Since multiple…
Peer Reviews
Decision·ICLR 2025 Poster
- The paper presents a new approach by breaking down compositional translation into smaller steps and combining it with verification for logical reasoning tasks, leading to improved performance in neurosymbolic translation. - The experimental results are robust using GPT4-o, showing improvements over other methods across multiple benchmarks. - The authors propose two SAT-based first-order logic verification algorithms for selecting a sample from LLMs' logical code generations.
- The approach is primarily applicable to problems that can be represented in a SAT solver, limiting its generalizability to other reasoning datasets, such as those involving mathematical equations or visual components, e.g., MATH dataset. - The core idea of breaking down tasks into subtasks and using multiple samples and tests (e.g., verification, self-reflection, deterministic tests) to select the best generation is not novel. - The paper lacks comparison with chain-of-thought (CoT) based meth
It is reasonable to improve the translation quality by decomposing a complex sentence into several shorter sentences. Using SAT solvers certainly improve the quality.
Not all natural language sentences can be translated to first-order logic forms. Authors did not discuss what sentences cannot be translated. Authors use symbolic SAT solver in evaluating and selecting correct first-order logical forms. This limits the method only for the case where SAT solvers work. Theoretically, the meaning of natural language is not logical formula. This work is valued within fixed benchmark datasets. The formalism of the paper is not easy to read.
- The results are promising; the authors do a fantastic job of motivating their new algorithm CLOVER by showing common failures of previous methods like Logic-LM, then show that their method fixes many of these errors (leading to the performance boost reported in Table 1). - The method here is pretty novel. Breaking down sentences into atoms isn't too novel, but I haven't seen someone decode them all individually and progressively (tapping into the auto-regressive natural of LMs) to improve the
- The ablations in Table 3 need to be explained more clearly and then discussed. What is "Is Clover?" and why is the simplest ablation (no clover, direct translation, no verification / essentially none of the new things introduced in this paper) outperforming Logic-LM on AR-LSAT by 10.9% already from Table 1? Does this mean that your direct translation prompt already improves over complex algorithms like Logic-LM? If so, this deflates the papers impact, so it should be addressed (it's possible
Code & Models
Videos
Taxonomy
TopicsLogic, Reasoning, and Knowledge · Semantic Web and Ontologies · Advanced Algebra and Logic
