GIM: Evaluating models via tasks that integrate multiple cognitive domains

Rohit Patel; Alexandre Rezende; Steven McClain

arXiv:2605.18663·cs.AI·May 19, 2026

GIM: Evaluating models via tasks that integrate multiple cognitive domains

Rohit Patel, Alexandre Rezende, Steven McClain

PDF

1 Datasets

TL;DR

The GIM benchmark evaluates language models on complex, multi-cognitive tasks grounded in realistic contexts, using a calibrated IRT model to produce robust ability estimates and analyze test-time compute effects.

Contribution

This paper introduces GIM, a new benchmark with 820 problems emphasizing integrated reasoning, and applies a calibrated IRT model for reliable ability measurement across diverse models.

Findings

01

GIM provides a balanced public-private problem set for contamination diagnostics.

02

Calibrated IRT model accurately orders models despite accuracy distortions.

03

Test-time compute significantly influences model performance within families.

Abstract

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

facebook/gim
dataset· 287 dl
287 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.