BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Vineet Bhat; Sungsu Kim; Valts Blukis; Greg Heinrich; Prashanth Krishnamurthy; Ramesh Karri; Stan Birchfield; Farshad Khorrami; Jonathan Tremblay

arXiv:2511.16857·cs.CV·April 21, 2026

BOP-ASK: Object-Interaction Reasoning for Vision-Language Models

Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay

PDF

2 Datasets

TL;DR

BOP-ASK is a large-scale dataset designed to improve object interaction reasoning in vision-language models, emphasizing fine-grained spatial understanding and physical reasoning.

Contribution

The paper introduces BOP-ASK, a novel dataset with detailed annotations for object interaction reasoning, enabling better training and benchmarking of VLMs.

Findings

01

Models trained on BOP-ASK outperform baselines.

02

Trained models show emergent capabilities like pose estimation and trajectory planning.

03

BOP-ASK enables testing of generalization with out-of-distribution data.

Abstract

Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.