BOP-ASK: Object-Interaction Reasoning for Vision-Language Models
Vineet Bhat, Sungsu Kim, Valts Blukis, Greg Heinrich, Prashanth Krishnamurthy, Ramesh Karri, Stan Birchfield, Farshad Khorrami, Jonathan Tremblay

TL;DR
BOP-ASK is a large-scale dataset designed to improve object interaction reasoning in vision-language models, emphasizing fine-grained spatial understanding and physical reasoning.
Contribution
The paper introduces BOP-ASK, a novel dataset with detailed annotations for object interaction reasoning, enabling better training and benchmarking of VLMs.
Findings
Models trained on BOP-ASK outperform baselines.
Trained models show emergent capabilities like pose estimation and trajectory planning.
BOP-ASK enables testing of generalization with out-of-distribution data.
Abstract
Vision Language Models (VLMs) have achieved impressive performance on spatial reasoning benchmarks, yet these evaluations mask critical weaknesses in understanding object interactions. Current benchmarks test high level relationships ('left of,' 'behind', etc.) but ignore fine-grained spatial understanding needed for real world applications: precise 3D localization, physical compatibility between objects, object affordances and multi step spatial planning. In this work, we present BOP-ASK, a novel large scale dataset for object interaction reasoning for both training and benchmarking. Our data generation pipeline leverages 6D object poses from the Benchmark for Object Pose Estimation (BOP) datasets from which we derive fine grained annotations such as grasp poses, referred object poses, path planning trajectories, relative spatial and depth relationships, and object-to-object…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
