SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot   Interaction

Jie Xu; Hanbo Zhang; Xinghang Li; Huaping Liu; Xuguang Lan; Tao Kong

arXiv:2402.11792·cs.RO·February 21, 2024·1 cites

SInViG: A Self-Evolving Interactive Visual Agent for Human-Robot Interaction

Jie Xu, Hanbo Zhang, Xinghang Li, Huaping Liu, Xuguang Lan, Tao Kong

PDF

Open Access

TL;DR

SInViG is a self-evolving visual agent that improves human-robot interaction by learning from unlabeled data and large language models, enabling natural multi-turn dialogues and robust performance in complex environments.

Contribution

Introduces SInViG, a novel self-evolving visual agent that enhances human-robot interaction through autonomous learning and multi-turn visual-language dialogue capabilities.

Findings

01

Sets new state-of-the-art on interactive visual grounding benchmarks.

02

Demonstrates improved human preference acquisition over time.

03

Successfully deployed on a robot for natural language interactive manipulation.

Abstract

Linguistic ambiguity is ubiquitous in our daily lives. Previous works adopted interaction between robots and humans for language disambiguation. Nevertheless, when interactive robots are deployed in daily environments, there are significant challenges for natural human-robot interaction, stemming from complex and unpredictable visual inputs, open-ended interaction, and diverse user demands. In this paper, we present SInViG, which is a self-evolving interactive visual agent for human-robot interaction based on natural languages, aiming to resolve language ambiguity, if any, through multi-turn visual-language dialogues. It continuously and automatically learns from unlabeled images and large language models, without human intervention, to be more robust against visual and linguistic complexity. Benefiting from self-evolving, it sets new state-of-the-art on several interactive visual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Human Pose and Action Recognition