Breaking Down the Task: A Unit-Grained Hybrid Training Framework for Vision and Language Decision Making
Ruipu Luo, Jiwen Zhang, Zhongyu Wei

TL;DR
This paper introduces a unit-grained hybrid training framework for vision-language decision making, improving learning efficiency and performance by breaking tasks into smaller units and leveraging a novel transformer model with cross-modal memory.
Contribution
It proposes a novel hybrid-training framework and a Unit-Transformer that enhances VLDM by dividing tasks into units and reducing exposure bias, applicable across models.
Findings
Outperforms state-of-the-art on TEACH benchmark
Effective in reducing exposure bias in training
Demonstrates improved decision-making accuracy
Abstract
Vision language decision making (VLDM) is a challenging multimodal task. The agent have to understand complex human instructions and complete compositional tasks involving environment navigation and object manipulation. However, the long action sequences involved in VLDM make the task difficult to learn. From an environment perspective, we find that task episodes can be divided into fine-grained \textit{units}, each containing a navigation phase and an interaction phase. Since the environment within a unit stays unchanged, we propose a novel hybrid-training framework that enables active exploration in the environment and reduces the exposure bias. Such framework leverages the unit-grained configurations and is model-agnostic. Specifically, we design a Unit-Transformer (UT) with an intrinsic recurrent state that maintains a unit-scale cross-modal memory. Through extensive experiments on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
