A modular vision language navigation and manipulation framework for long   horizon compositional tasks in indoor environment

Homagni Saha; Fateme Fotouhif; Qisai Liu; Soumik Sarkar

arXiv:2101.07891·cs.CV·January 21, 2021

A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment

Homagni Saha, Fateme Fotouhif, Qisai Liu, Soumik Sarkar

PDF

Open Access 1 Repo

TL;DR

This paper introduces MoViLan, a modular framework for indoor vision-language tasks that effectively handles complex, long-horizon navigation and manipulation without requiring strictly aligned training data, improving success rates on benchmarks.

Contribution

The paper presents a novel modular approach with geometry-aware mapping and generalized language understanding, enabling better performance on complex indoor tasks.

Findings

01

Significant increase in success rates on ALFRED benchmark.

02

Effective handling of long-horizon, compositional tasks.

03

Modular approach reduces dependency on aligned training data.

Abstract

In this paper we propose a new framework - MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non-reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Homagn/MOVILAN
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques