FOL-Traces: Verified First-Order Logic Reasoning Traces at Scale
Isabelle Lee, Sarah Liaw, Dani Yogatama

TL;DR
FOL-Traces is a large-scale, verified dataset for evaluating structured logical inference in language models, addressing previous limitations of unverifiable traces and small datasets.
Contribution
The paper introduces FOL-Traces, the first large-scale, programmatically verified dataset for logical reasoning, along with diagnostic tasks to evaluate model inference fidelity.
Findings
Models achieve around 45.7% accuracy on masked operation prediction.
Models reach about 27% accuracy on two-step completion.
FOL-Traces remains a challenging benchmark for reasoning models.
Abstract
Reasoning in language models is difficult to evaluate: natural-language traces are unverifiable, symbolic datasets are too small, and most benchmarks conflate heuristics with inference. We present FOL-Traces, the first large-scale dataset of programmatically verified reasoning traces, enabling rigorous evaluation of structured logical inference. We also propose two challenging and comprehensive diagnostic tasks-masked operation prediction and step completion-that directly probe syntactic awareness and process fidelity. FOL-Traces serves as a scalable testbed for rigorously studying how models perform structured logical inference. Systematic experiments with 5 reasoning LLMs show that the dataset remains challenging: models only reach around 45.7% accuracy on masked operation prediction and around 27% on two-step completion.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsLogic, Reasoning, and Knowledge · Advanced Algebra and Logic · Logic, programming, and type systems
