Modular Framework for Visuomotor Language Grounding

Kolby Nottingham; Litian Liang; Daeyun Shin; Charless C. Fowlkes; Roy; Fox; Sameer Singh

arXiv:2109.02161·cs.AI·September 7, 2021·5 cites

Modular Framework for Visuomotor Language Grounding

Kolby Nottingham, Litian Liang, Daeyun Shin, Charless C. Fowlkes, Roy, Fox, Sameer Singh

PDF

Open Access

TL;DR

This paper introduces a modular LAV framework that separates language, action, and vision components for more efficient training in visuomotor instruction tasks, demonstrated on the ALFRED benchmark.

Contribution

It proposes a novel modular approach that decouples components, reducing data dependency and improving training efficiency for grounded language and robotics tasks.

Findings

01

LAV framework enables independent training of modules.

02

Preliminary evaluation shows promising results on ALFRED.

03

Modular design improves data efficiency and flexibility.

Abstract

Natural language instruction following tasks serve as a valuable test-bed for grounded language and robotics research. However, data collection for these tasks is expensive and end-to-end approaches suffer from data inefficiency. We propose the structuring of language, acting, and visual tasks into separate modules that can be trained independently. Using a Language, Action, and Vision (LAV) framework removes the dependence of action and vision modules on instruction following datasets, making them more efficient to train. We also present a preliminary evaluation of LAV on the ALFRED task for visual and interactive instruction following.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling