Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

Makram Chahine; Alex Quach; Alaa Maalouf; Tsun-Hsuan Wang; Daniela Rus

arXiv:2410.13002·cs.RO·May 19, 2025

Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features

Makram Chahine, Alex Quach, Alaa Maalouf, Tsun-Hsuan Wang, Daniela Rus

PDF

Open Access

TL;DR

Flex leverages pre-trained vision-language models as fixed feature extractors to enable robust, end-to-end visual navigation that generalizes well to unseen environments and instructions with minimal training data.

Contribution

The paper introduces Flex, a framework that uses frozen VLM features for improved generalization in vision-based navigation tasks with minimal data.

Findings

01

Successful transfer from simulation to real-world scenes.

02

Effective generalization to new goals and commands.

03

Minimal data needed for robust performance.

Abstract

End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization

MethodsAttentive Walk-Aggregating Graph Neural Network