Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features
Makram Chahine, Alex Quach, Alaa Maalouf, Tsun-Hsuan Wang, Daniela Rus

TL;DR
Flex leverages pre-trained vision-language models as fixed feature extractors to enable robust, end-to-end visual navigation that generalizes well to unseen environments and instructions with minimal training data.
Contribution
The paper introduces Flex, a framework that uses frozen VLM features for improved generalization in vision-based navigation tasks with minimal data.
Findings
Successful transfer from simulation to real-world scenes.
Effective generalization to new goals and commands.
Minimal data needed for robust performance.
Abstract
End-to-end learning directly maps sensory inputs to actions, creating highly integrated and efficient policies for complex robotics tasks. However, such models often struggle to generalize beyond their training scenarios, limiting adaptability to new environments, tasks, and concepts. In this work, we investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies under unseen text instructions and visual distribution shifts. Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors, generating spatially aware embeddings that integrate semantic and visual information. We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Video Analysis and Summarization
MethodsAttentive Walk-Aggregating Graph Neural Network
