VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement Learning
Thomas Carta, Subhajit Chaudhury, Kartik Talamadupula, Michiaki, Tatsubori

TL;DR
VisualHints introduces a new multimodal reinforcement learning environment combining visual clues and natural language interactions, aiming to bridge the gap between vision-based and language-based RL tasks for more realistic problem-solving.
Contribution
We developed VisualHints, a unified environment that integrates visual and textual modalities for RL, extending TextWorld with visual clues to better emulate real-world scenarios.
Findings
Baseline multimodal agent demonstrates effective use of visual and textual features.
Environment variations increase task complexity and realism.
Potential to foster new research in multimodal RL.
Abstract
We present VisualHints, a novel environment for multimodal reinforcement learning (RL) involving text-based interactions along with visual hints (obtained from the environment). Real-life problems often demand that agents interact with the environment using both natural language information and visual perception towards solving a goal. However, most traditional RL environments either solve pure vision-based tasks like Atari games or video-based robotic manipulation; or entirely use natural language as a mode of interaction, like Text-based games and dialog systems. In this work, we aim to bridge this gap and unify these two approaches in a single environment for multimodal RL. We introduce an extension of the TextWorld cooking environment with the addition of visual clues interspersed throughout the environment. The goal is to force an RL agent to use both text and visual features to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications
