Are you doing what I say? On modalities alignment in ALFRED

Ting-Rui Chiang; Yi-Ting Yeh; Ta-Chung Chi; Yau-Shian Wang

arXiv:2110.05665·cs.CL·October 13, 2021·1 cites

Are you doing what I say? On modalities alignment in ALFRED

Ting-Rui Chiang, Yi-Ting Yeh, Ta-Chung Chi, Yau-Shian Wang

PDF

Open Access

TL;DR

This paper investigates the importance of aligning natural language instructions with visual inputs in the ALFRED benchmark, introduces a metric to measure alignment, and proposes methods to improve it, leading to better task performance.

Contribution

The paper introduces the boundary adherence score (BAS) to measure modality alignment and proposes approaches to enhance alignment, improving overall task success in ALFRED.

Findings

01

Existing models poorly align text and visual modalities.

02

Improved alignment correlates with higher task success.

03

Proposed methods effectively enhance modality alignment.

Abstract

ALFRED is a recently proposed benchmark that requires a model to complete tasks in simulated house environments specified by instructions in natural language. We hypothesize that key to success is accurately aligning the text modality with visual inputs. Motivated by this, we inspect how well existing models can align these modalities using our proposed intrinsic metric, boundary adherence score (BAS). The results show the previous models are indeed failing to perform proper alignment. To address this issue, we introduce approaches aimed at improving model alignment and demonstrate how improved alignment, improves end task performance.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Advanced Neural Network Applications · Multimodal Machine Learning Applications