Are you doing what I say? On modalities alignment in ALFRED
Ting-Rui Chiang, Yi-Ting Yeh, Ta-Chung Chi, Yau-Shian Wang

TL;DR
This paper investigates the importance of aligning natural language instructions with visual inputs in the ALFRED benchmark, introduces a metric to measure alignment, and proposes methods to improve it, leading to better task performance.
Contribution
The paper introduces the boundary adherence score (BAS) to measure modality alignment and proposes approaches to enhance alignment, improving overall task success in ALFRED.
Findings
Existing models poorly align text and visual modalities.
Improved alignment correlates with higher task success.
Proposed methods effectively enhance modality alignment.
Abstract
ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (BAS). The results show the previous models are indeed failing to perform proper alignment. To address this issue, we introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Advanced Neural Network Applications · Multimodal Machine Learning Applications
