Zero-shot Interactive Perception
Venkatesh Sripada, Frank Guerin, Amir Ghalamzan

TL;DR
Zero-Shot Interactive Perception (ZS-IP) introduces a framework combining multi-strategy manipulation and a memory-driven Vision Language Model to improve robotic perception and interaction in complex, occluded environments.
Contribution
The paper presents ZS-IP, a novel framework integrating pushlines and memory-guided reasoning for enhanced robotic perception and manipulation without prior training.
Findings
ZS-IP outperforms passive perception methods in pushing tasks.
Pushlines improve affordance detection for contact-rich actions.
ZS-IP maintains non-target element integrity during interactions.
Abstract
Interactive perception (IP) enables robots to extract hidden information in their workspace and execute manipulation plans by physically interacting with objects and altering the state of the environment -- crucial for resolving occlusions and ambiguity in complex, partially observable scenarios. We present Zero-Shot IP (ZS-IP), a novel framework that couples multi-strategy manipulation (pushing and grasping) with a memory-driven Vision Language Model (VLM) to guide robotic interactions and resolve semantic queries. ZS-IP integrates three key components: (1) an Enhanced Observation (EO) module that augments the VLM's visual perception with both conventional keypoints and our proposed pushlines -- a novel 2D visual augmentation tailored to pushing actions, (2) a memory-guided action module that reinforces semantic reasoning through context lookup, and (3) a robotic controller that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Multimodal Machine Learning Applications · Social Robot Interaction and HRI
