PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

Zhiyu Zhou; Peilin Liu; Ruoxuan Zhang; Luyang Zhang; Cheng Zhang; Hongxia Xie; Wen-Huang Cheng

arXiv:2604.08991·cs.CV·May 15, 2026

PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos

Zhiyu Zhou, Peilin Liu, Ruoxuan Zhang, Luyang Zhang, Cheng Zhang, Hongxia Xie, Wen-Huang Cheng

PDF

1 Repo 2 Models 1 Datasets

TL;DR

PinpointQA introduces a new dataset and benchmark for evaluating small object localization and spatial understanding in indoor videos, highlighting current model limitations and aiding future improvements.

Contribution

It provides the first comprehensive dataset and benchmark specifically designed for small object-centric spatial understanding in indoor videos, with tasks of increasing difficulty.

Findings

01

MLLMs show significant performance gaps on PinpointQA tasks.

02

Supervised fine-tuning improves model performance, especially on harder tasks.

03

SSP task remains particularly challenging for current models.

Abstract

Small object-centric spatial understanding in indoor videos remains a significant challenge for multimodal large language models (MLLMs), despite its practical value for object search and assistive applications. Although existing benchmarks have advanced video spatial intelligence, embodied reasoning, and diagnostic perception, no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use. In this work, we introduce PinpointQA, the first dataset and benchmark for small object-centric spatial understanding in indoor videos. Built from ScanNet++ and ScanNet200, PinpointQA comprises 1,024 scenes and 10,094 QA pairs organized into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://rainchowz.github.io/PinpointQA
github

Models

Datasets

RainChow/PinpointQA
dataset· 376 dl
376 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.