A modular vision language navigation and manipulation framework for long horizon compositional tasks in indoor environment
Homagni Saha, Fateme Fotouhif, Qisai Liu, Soumik Sarkar

TL;DR
This paper introduces MoViLan, a modular framework for indoor vision-language tasks that effectively handles complex, long-horizon navigation and manipulation without requiring strictly aligned training data, improving success rates on benchmarks.
Contribution
The paper presents a novel modular approach with geometry-aware mapping and generalized language understanding, enabling better performance on complex indoor tasks.
Findings
Significant increase in success rates on ALFRED benchmark.
Effective handling of long-horizon, compositional tasks.
Modular approach reduces dependency on aligned training data.
Abstract
In this paper we propose a new framework - MoViLan (Modular Vision and Language) for execution of visually grounded natural language instructions for day to day indoor household tasks. While several data-driven, end-to-end learning frameworks have been proposed for targeted navigation tasks based on the vision and language modalities, performance on recent benchmark data sets revealed the gap in developing comprehensive techniques for long horizon, compositional tasks (involving manipulation and navigation) with diverse object categories, realistic instructions and visual scenarios with non-reversible state changes. We propose a modular approach to deal with the combined navigation and object interaction problem without the need for strictly aligned vision and language training data (e.g., in the form of expert demonstrated trajectories). Such an approach is a significant departure from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
