VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement   Learning

Thomas Carta; Subhajit Chaudhury; Kartik Talamadupula; Michiaki; Tatsubori

arXiv:2010.13839·cs.LG·October 28, 2020·1 cites

VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement Learning

Thomas Carta, Subhajit Chaudhury, Kartik Talamadupula, Michiaki, Tatsubori

PDF

Open Access

TL;DR

VisualHints introduces a new multimodal reinforcement learning environment combining visual clues and natural language interactions, aiming to bridge the gap between vision-based and language-based RL tasks for more realistic problem-solving.

Contribution

We developed VisualHints, a unified environment that integrates visual and textual modalities for RL, extending TextWorld with visual clues to better emulate real-world scenarios.

Findings

01

Baseline multimodal agent demonstrates effective use of visual and textual features.

02

Environment variations increase task complexity and realism.

03

Potential to foster new research in multimodal RL.

Abstract

We present VisualHints, a novel environment for multimodal reinforcement learning (RL) involving text-based interactions along with visual hints (obtained from the environment). Real-life problems often demand that agents interact with the environment using both natural language information and visual perception towards solving a goal. However, most traditional RL environments either solve pure vision-based tasks like Atari games or video-based robotic manipulation; or entirely use natural language as a mode of interaction, like Text-based games and dialog systems. In this work, we aim to bridge this gap and unify these two approaches in a single environment for multimodal RL. We introduce an extension of the TextWorld cooking environment with the addition of visual clues interspersed throughout the environment. The goal is to force an RL agent to use both text and visual features to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications