I-RAVEN-X: Benchmarking Generalization and Robustness of Analogical and Mathematical Reasoning in Large Language and Reasoning Models
Giacomo Camposampiero, Michael Hersche, Roger Wattenhofer, Abu Sebastian, Abbas Rahimi

TL;DR
I-RAVEN-X is a new benchmark that assesses the generalization and robustness of large language and reasoning models in analogical and mathematical reasoning, highlighting their strengths and current limitations.
Contribution
It extends the I-RAVEN benchmark to include more complex reasoning scenarios and evaluates the performance of LRMs and LLMs under these conditions.
Findings
LRMs outperform LLMs in productivity and systematicity.
LRMs struggle with reasoning under uncertainty.
Models have difficulty exploring multiple probabilistic outcomes.
Abstract
We introduce I-RAVEN-X, a symbolic benchmark designed to evaluate generalization and robustness in analogical and mathematical reasoning for Large Language Models (LLMs) and Large Reasoning Models (LRMs). I-RAVEN-X extends I-RAVEN by increasing operand complexity, attribute range, and introducing perceptual uncertainty. Compared to LLMs, empirical results show that LRMs achieve improved productivity and systematicity on longer reasoning relations and wider attribute ranges, respectively. However, LRMs are still significantly challenged by reasoning under uncertainty and cannot effectively explore multiple probabilistic outcomes.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Computational and Text Analysis Methods
