Curriculum Learning for Vision-and-Language Navigation

Jiwen Zhang; Zhongyu Wei; Jianqing Fan; Jiajie Peng

arXiv:2111.07228·cs.LG·November 16, 2021·1 cites

Curriculum Learning for Vision-and-Language Navigation

Jiwen Zhang, Zhongyu Wei, Jianqing Fan, Jiajie Peng

PDF

Open Access 1 Video

TL;DR

This paper introduces a curriculum learning approach for Vision-and-Language Navigation that improves agent performance and training efficiency by systematically ordering training samples based on difficulty.

Contribution

We propose a novel curriculum-based training paradigm for VLN that re-arranges datasets to better match human prior knowledge and learning progress.

Findings

01

Significant performance improvements on R2R benchmark

02

Enhanced generalizability of navigation agents

03

Increased training efficiency without added model complexity

Abstract

Vision-and-Language Navigation (VLN) is a task where an agent navigates in an embodied indoor environment under human instructions. Previous works ignore the distribution of sample difficulty and we argue that this potentially degrade their agent performance. To tackle this issue, we propose a novel curriculum-based training paradigm for VLN tasks that can balance human prior knowledge and agent learning progress about training samples. We develop the principle of curriculum design and re-arrange the benchmark Room-to-Room (R2R) dataset to make it suitable for curriculum training. Experiments show that our method is model-agnostic and can significantly improve the performance, the generalizability, and the training efficiency of current state-of-the-art navigation agents without increasing model complexity.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Curriculum Learning for Vision-and-Language Navigation· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition