Harnessing Input-Adaptive Inference for Efficient VLN
Dongwoo Kang, Akhil Perincherry, Zachary Coalson, Aiden Gabriel, Stefan Lee, Sanghyun Hong

TL;DR
This paper introduces input-adaptive algorithms for vision-and-language navigation that significantly reduce computational costs while maintaining performance, making VLN models more practical for resource-limited settings.
Contribution
It proposes three novel adaptive algorithms at different levels—spatial, intra-model, and temporal—for improving VLN efficiency without performance loss.
Findings
Over 2× reduction in computation across multiple benchmarks
Effective in both standard and continuous environments
Maintains comparable navigation performance
Abstract
An emerging paradigm in vision-and-language navigation (VLN) is the use of history-aware multi-modal transformer models. Given a language instruction, these models process observation and navigation history to predict the most appropriate action for an agent. While they have significantly improved performance, the scale of these models can be a bottleneck in practical settings with limited computational resources. In this work, we propose a novel input-adaptive navigation method to enhance VLN model efficiency. We first show that existing input-adaptive mechanisms fail to reduce computations without substantial performance degradation. To address this, we introduce three adaptive algorithms, each deployed at a different level: (1) To improve spatial efficiency, we selectively process panoramic views at each observation of an agent. (2) To improve intra-model efficiency, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
