TL;DR
This paper introduces SoLT, a benchmark for testing LLMs' logical reasoning across diverse linguistic forms, and MenTaL, a method to improve their consistency by linking expressions to shared symbols, enhancing reasoning stability.
Contribution
The paper presents a new benchmark, SoLT, for evaluating LLMs on linguistically diverse logical reasoning, and proposes MenTaL, a method to improve symbol consistency during translation.
Findings
LLMs struggle with inconsistent symbol mapping under linguistic variation.
Applying MenTaL improves reasoning accuracy and stability across diverse inputs.
Linguistic diversity significantly impacts LLM-based logical reasoning performance.
Abstract
Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
