LogicSkills: A Structured Benchmark for Formal Reasoning in Large Language Models
Brian Rabern, Philipp Mondorf, Barbara Plank

TL;DR
This paper introduces LogicSkills, a benchmark designed to evaluate fundamental logical skills in large language models, revealing strengths in validity assessment but weaknesses in symbolization and countermodel construction.
Contribution
The paper presents a novel benchmark isolating core logical skills and evaluates LLMs, highlighting gaps in their logical reasoning capabilities.
Findings
High performance in validity assessment by LLMs
Lower performance in formal symbolization and countermodel construction
Reasoning-tuned models perform better across all skills
Abstract
Large language models perform well on many logical reasoning benchmarks, but it remains unclear which core logical skills they truly master. To address this, we introduce LogicSkills, a benchmark that isolates three fundamental logical skills: (i) translating premises into first-order logic; (ii) showing that an argument is logically invalid by constructing a finite countermodel; and (iii) determining whether a conclusion follows from a set of premises. Items are drawn from the two-variable fragment of first-order logic without identity and are presented in both English and a Carrollian nonce-word language. All instances are solver-verified with Z3 for correctness and non-triviality. Across conventional instruction-tuned LLMs, performance is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Constraint Satisfaction and Optimization
