Following Instructions by Imagining and Reaching Visual Goals
John Kanu, Eadom Dessalene, Xiaomin Lin, Cornelia Fermuller, Yiannis, Aloimonos

TL;DR
This paper introduces a novel reinforcement learning framework that enables robots to perform complex, temporally extended tasks by imagining visual goals and reasoning spatially, all from raw pixel inputs without prior knowledge.
Contribution
The work presents a new approach for learning to follow instructions through visual goal imagination and spatial reasoning in RL, operating directly on raw images without prior linguistic or perceptual data.
Findings
Outperforms flat architectures with raw pixels and ground-truth states.
Outperforms hierarchical architectures with ground-truth states on object arrangement tasks.
Effective in simulated 3D environments for robotic tasks.
Abstract
While traditional methods for instruction-following typically assume prior linguistic and perceptual knowledge, many recent works in reinforcement learning (RL) have proposed learning policies end-to-end, typically by training neural networks to map joint representations of observations and instructions directly to actions. In this work, we present a novel framework for learning to perform temporally extended tasks using spatial reasoning in the RL framework, by sequentially imagining visual goals and choosing appropriate actions to fulfill imagined goals. Our framework operates on raw pixel images, assumes no prior linguistic or perceptual knowledge, and learns via intrinsic motivation and a single extrinsic reward signal measuring task completion. We validate our method in two environments with a robot arm in a simulated interactive 3D environment. Our method outperforms two flat…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Multimodal Machine Learning Applications · Advanced Vision and Imaging
