InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot   Interactions

Hanbo Zhang; Jie Xu; Yuchen Mo; Tao Kong

arXiv:2310.12147·cs.RO·October 19, 2023·1 cites

InViG: Benchmarking Interactive Visual Grounding with 500K Human-Robot Interactions

Hanbo Zhang, Jie Xu, Yuchen Mo, Tao Kong

PDF

Open Access 1 Repo

TL;DR

This paper introduces InViG, a large-scale dataset with over 520,000 images and dialogues for benchmarking interactive visual grounding in human-robot interaction, addressing ambiguity in communication.

Contribution

It provides the first large-scale dataset for open-ended interactive visual grounding and baseline solutions, advancing research in ambiguity-aware human-robot interaction.

Findings

01

Achieved a 45.6% success rate in validation tasks.

02

Created a dataset with millions of object instances and dialogue pairs.

03

Established baseline methods for end-to-end visual disambiguation.

Abstract

Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates, leading to reduced performance in realistic and open-ended scenarios. To address these issues, we present a large-scale dataset, \invig, for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues, encompassing millions of object instances and corresponding question-answer pairs. Leveraging the \invig dataset, we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding, achieving a 45.6\% success rate during validation. To the best of our knowledge, the \invig dataset is the first large-scale dataset for resolving open-ended interactive visual grounding,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhanghanbo/invig-dataset
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsSparse Evolutionary Training