A Deep Compositional Framework for Human-like Language Acquisition in Virtual Environment
Haonan Yu, Haichao Zhang, and Wei Xu

TL;DR
This paper presents a deep, compositional framework enabling a virtual agent to learn language and navigation in a 2D maze, achieving zero-shot command execution through grounded, modular learning.
Contribution
It introduces an end-to-end deep learning approach that learns visual, linguistic, and action representations simultaneously, enabling zero-shot language understanding in a virtual environment.
Findings
Agent can execute zero-shot commands involving new word combinations
Agent understands new object concepts learned from other tasks
Framework visualizes intermediate representations showing comprehension
Abstract
We tackle a task where an agent learns to navigate in a 2D maze-like environment called XWORLD. In each session, the agent perceives a sequence of raw-pixel frames, a natural language command issued by a teacher, and a set of rewards. The agent learns the teacher's language from scratch in a grounded and compositional manner, such that after training it is able to correctly execute zero-shot commands: 1) the combination of words in the command never appeared before, and/or 2) the command contains new object concepts that are learned from another task but never learned from navigation. Our deep framework for the agent is trained end to end: it learns simultaneously the visual representations of the environment, the syntax and semantics of the language, and the action module that outputs actions. The zero-shot learning capability of our framework results from its compositionality and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Speech and dialogue systems
