GIM: Evaluating models via tasks that integrate multiple cognitive domains
Rohit Patel, Alexandre Rezende, Steven McClain

TL;DR
The GIM benchmark evaluates language models on complex, multi-cognitive tasks grounded in realistic contexts, using a calibrated IRT model to produce robust ability estimates and analyze test-time compute effects.
Contribution
This paper introduces GIM, a new benchmark with 820 problems emphasizing integrated reasoning, and applies a calibrated IRT model for reliable ability measurement across diverse models.
Findings
GIM provides a balanced public-private problem set for contamination diagnostics.
Calibrated IRT model accurately orders models despite accuracy distortions.
Test-time compute significantly influences model performance within families.
Abstract
As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
