Spatial Reasoning from Natural Language Instructions for Robot Manipulation
Sagar Gubbi Venkatesh, Anirban Biswas, Raviteja Upadrashta and, Vikram Srinivasan, Partha Talukdar, Bharadwaj Amrutur

TL;DR
This paper presents a two-stage spatial reasoning system that enables robots to interpret natural language instructions for object manipulation, improving localization and task execution in unstructured environments.
Contribution
A novel pipelined architecture that localizes objects and maps language instructions to robot actions, with quantized spatial representation and attention mechanisms enhancing generalization.
Findings
Quantized spatial representation outperforms coordinate lists.
Attention mechanisms improve generalization and bias mitigation.
Successful robot pick-and-place of playing cards.
Abstract
Robots that can manipulate objects in unstructured environments and collaborate with humans can benefit immensely by understanding natural language. We propose a pipelined architecture of two stages to perform spatial reasoning on the text input. All the objects in the scene are first localized, and then the instruction for the robot in natural language and the localized co-ordinates are mapped to the start and end co-ordinates corresponding to the locations where the robot must pick up and place the object respectively. We show that representing the localized objects by quantizing their positions to a binary grid is preferable to representing them as a list of 2D co-ordinates. We also show that attention improves generalization and can overcome biases in the dataset. The proposed method is used to pick-and-place playing cards using a robot arm.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
