OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization
Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song

TL;DR
This paper introduces OMEGA, a diverse benchmark to evaluate large language models' ability to generalize in mathematics through exploratory, compositional, and transformative reasoning, revealing current limitations and guiding future improvements.
Contribution
The paper presents OMEGA, a novel benchmark for assessing out-of-distribution mathematical reasoning in LLMs across three creativity-inspired axes, with detailed analysis of model performance and limitations.
Findings
LLMs' performance drops with increasing problem complexity.
Fine-tuning improves exploratory reasoning but not compositional or transformative reasoning.
Transformative reasoning remains a significant challenge for current LLMs.
Abstract
Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
