TL;DR
The paper introduces SCBench, a comprehensive benchmark for evaluating spatial competence in large models across hierarchical tasks, revealing accuracy limitations and failure modes.
Contribution
It presents a new hierarchical spatial competence benchmark with task generators, verifiers, and visualization tools, addressing limitations of existing spatial evaluations.
Findings
Frontier models show decreasing accuracy up the capability ladder.
Accuracy gains are concentrated at low output-token budgets.
Failures often involve locally plausible geometry breaking global constraints.
Abstract
Spatial competence is the quality of maintaining a consistent internal representation of an environment and using it to infer discrete structure and plan actions under constraints. Prevailing spatial evaluations for large models are limited to probing isolated primitives through 3D transformations or visual question answering. We introduce the Spatial Competence Benchmark (SCBench), spanning three hierarchical capability buckets whose tasks require executable outputs verified by deterministic checkers or simulator-based evaluators. On SCBench, three frontier models exhibit monotonically decreasing accuracy up the capability ladder. Sweeping output-token caps shows that accuracy gains concentrate at low budgets and saturate quickly, and failures are dominated by locally plausible geometry that breaks global constraints. We release the task generators, verifiers, and visualisation tooling.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
