Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen; Qianyu He; Siyu Yuan; Aili Chen; Zhicheng Cai; Weinan Dai; Hongli Yu; Qiying Yu; Xuefeng Li; Jiaze Chen; Hao Zhou; Mingxuan Wang

arXiv:2505.19914·cs.CL·June 10, 2025

Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles

Jiangjie Chen, Qianyu He, Siyu Yuan, Aili Chen, Zhicheng Cai, Weinan Dai, Hongli Yu, Qiying Yu, Xuefeng Li, Jiaze Chen, Hao Zhou, Mingxuan Wang

PDF

Open Access

TL;DR

Enigmata introduces a comprehensive suite of synthetic puzzles and evaluation benchmarks to significantly improve the logical reasoning capabilities of large language models through scalable multi-task reinforcement learning.

Contribution

The paper presents Enigmata, a novel multi-task puzzle suite with generator-verifier design, and demonstrates its effectiveness in enhancing LLM reasoning across multiple benchmarks and tasks.

Findings

01

Qwen2.5-32B-Enigmata outperforms previous models on puzzle benchmarks.

02

Enigmata-trained models generalize well to out-of-domain reasoning tasks.

03

Puzzle data from Enigmata boosts performance on advanced math and STEM reasoning.

Abstract

Large Language Models (LLMs), such as OpenAI's o1 and DeepSeek's R1, excel at advanced reasoning tasks like math and coding via Reinforcement Learning with Verifiable Rewards (RLVR), but still struggle with puzzles solvable by humans without domain knowledge. We introduce Enigmata, the first comprehensive suite tailored for improving LLMs with puzzle reasoning skills. It includes 36 tasks across seven categories, each with 1) a generator that produces unlimited examples with controllable difficulty and 2) a rule-based verifier for automatic evaluation. This generator-verifier design supports scalable, multi-task RL training, fine-grained analysis, and seamless RLVR integration. We further propose Enigmata-Eval, a rigorous benchmark, and develop optimized multi-task RLVR strategies. Our trained model, Qwen2.5-32B-Enigmata, consistently surpasses o3-mini-high and o1 on the puzzle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies