OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Yiyou Sun; Shawn Hu; Georgia Zhou; Ken Zheng; Hannaneh Hajishirzi; Nouha Dziri; Dawn Song

arXiv:2506.18880·cs.CL·June 24, 2025

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Yiyou Sun, Shawn Hu, Georgia Zhou, Ken Zheng, Hannaneh Hajishirzi, Nouha Dziri, Dawn Song

PDF

5 Datasets

TL;DR

This paper introduces OMEGA, a diverse benchmark to evaluate large language models' ability to generalize in mathematics through exploratory, compositional, and transformative reasoning, revealing current limitations and guiding future improvements.

Contribution

The paper presents OMEGA, a novel benchmark for assessing out-of-distribution mathematical reasoning in LLMs across three creativity-inspired axes, with detailed analysis of model performance and limitations.

Findings

01

LLMs' performance drops with increasing problem complexity.

02

Fine-tuning improves exploratory reasoning but not compositional or transformative reasoning.

03

Transformative reasoning remains a significant challenge for current LLMs.

Abstract

Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.