AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning   Abilities of Large Language Models

Qin Zhu; Fei Huang; Runyu Peng; Keming Lu; Bowen Yu; Qinyuan Cheng,; Xipeng Qiu; Xuanjing Huang; Junyang Lin

arXiv:2502.16906·cs.CL·February 25, 2025

AutoLogi: Automated Generation of Logic Puzzles for Evaluating Reasoning Abilities of Large Language Models

Qin Zhu, Fei Huang, Runyu Peng, Keming Lu, Bowen Yu, Qinyuan Cheng,, Xipeng Qiu, Xuanjing Huang, Junyang Lin

PDF

Open Access 1 Repo

TL;DR

AutoLogi introduces an automated, bilingual logic puzzle benchmark for more accurate evaluation of large language models' reasoning abilities, overcoming limitations of multiple-choice tests and enabling systematic training data generation.

Contribution

The paper presents a novel automated method for synthesizing open-ended logic puzzles with controllable difficulty, improving reasoning assessment and training data quality for LLMs.

Findings

01

AutoLogi provides more reliable reasoning evaluation with scores from 35% to 73%.

02

It outperforms traditional multiple-choice benchmarks in reflecting true model capabilities.

03

The synthesis method supports generating high-quality training data for LLM improvement.

Abstract

While logical reasoning evaluation of Large Language Models (LLMs) has attracted significant attention, existing benchmarks predominantly rely on multiple-choice formats that are vulnerable to random guessing, leading to overestimated performance and substantial performance fluctuations. To obtain more accurate assessments of models' reasoning capabilities, we propose an automated method for synthesizing open-ended logic puzzles, and use it to develop a bilingual benchmark, AutoLogi. Our approach features program-based verification and controllable difficulty levels, enabling more reliable evaluation that better distinguishes models' reasoning abilities. Extensive evaluation of eight modern LLMs shows that AutoLogi can better reflect true model capabilities, with performance scores spanning from 35% to 73% compared to the narrower range of 21% to 37% on the source multiple-choice…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

8188zq/AutoLogi
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Intelligent Tutoring Systems and Adaptive Learning