Explore and Tell: Embodied Visual Captioning in 3D Environments

Anwen Hu; Shizhe Chen; Liang Zhang; Qin Jin

arXiv:2308.10447·cs.CV·August 22, 2023

Explore and Tell: Embodied Visual Captioning in 3D Environments

Anwen Hu, Shizhe Chen, Liang Zhang, Qin Jin

PDF

Open Access 1 Video

TL;DR

This paper introduces Embodied Captioning, a task where an agent navigates 3D environments to generate detailed scene descriptions, supported by a new dataset and a novel model that outperforms baselines.

Contribution

It proposes the Embodied Captioning task, creates the ET-Cap dataset, and develops the CaBOT model combining navigation and captioning for comprehensive scene understanding.

Findings

01

CaBOT outperforms baseline models in descriptive accuracy

02

The ET-Cap dataset contains 10K annotated 3D scenes

03

Navigation improves caption quality in complex environments

Abstract

While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Explore and Tell: Embodied Visual Captioning in 3D Environments· youtube

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques