Few-Shot Visual Grounding for Natural Human-Robot Interaction
Giorgos Tziafas, Hamidreza Kasaei

TL;DR
This paper introduces a novel single-stage zero-shot deep neural network for visual grounding in human-robot interaction, enabling robots to understand verbal references to objects in crowded scenes without prior training on specific objects.
Contribution
The paper presents a new single-stage zero-shot visual grounding model that outperforms traditional methods relying on pre-trained detectors, enhancing real-time understanding in dynamic environments.
Findings
High accuracy and speed in real RGB-D data
Robustness to natural language variation
Effective in crowded scenes
Abstract
Natural Human-Robot Interaction (HRI) is one of the key components for service robots to be able to work in human-centric environments. In such dynamic environments, the robot needs to understand the intention of the user to accomplish a task successfully. Towards addressing this point, we propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. Unlike most grounding methods that tackle the challenge using pre-trained object detectors via a two-stepped process, we develop a single stage zero-shot model that is able to provide predictions in unseen data. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets. Experimental results showed that the proposed model performs well in terms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodstravel james
