Modular Framework for Visuomotor Language Grounding
Kolby Nottingham, Litian Liang, Daeyun Shin, Charless C. Fowlkes, Roy, Fox, Sameer Singh

TL;DR
This paper introduces a modular LAV framework that separates language, action, and vision components for more efficient training in visuomotor instruction tasks, demonstrated on the ALFRED benchmark.
Contribution
It proposes a novel modular approach that decouples components, reducing data dependency and improving training efficiency for grounded language and robotics tasks.
Findings
LAV framework enables independent training of modules.
Preliminary evaluation shows promising results on ALFRED.
Modular design improves data efficiency and flexibility.
Abstract
Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. However, data collection for these tasks is expensive and end-to-end approaches suffer from data inefficiency. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently. Using a Language, Action, and Vision (LAV) framework removes the dependence of action and vision modules on instruction following datasets, making them more efficient to train. We also present a preliminary evaluation of LAV on the ALFRED task for visual and interactive instruction following.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
