AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models
Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng,, Xipeng Qiu, Xuanjing Huang, Junyang Lin

TL;DR
AutoLogi introduces an automated, bilingual logic puzzle benchmark for more accurate evaluation of large language models' reasoning abilities, overcoming limitations of multiple-choice tests and enabling systematic training data generation.
Contribution
The paper presents a novel automated method for synthesizing open-ended logic puzzles with controllable difficulty, improving reasoning assessment and training data quality for LLMs.
Findings
AutoLogi provides more reliable reasoning evaluation with scores from 35% to 73%.
It outperforms traditional multiple-choice benchmarks in reflecting true model capabilities.
The synthesis method supports generating high-quality training data for LLM improvement.
Abstract
While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning
