Seeing is Believing? Enhancing Vision-Language Navigation using Visual Perturbations
Xuesong Zhang, Jia Li, Yunbo Xu, Zhenzhen Hu, Richang Hong

TL;DR
This paper investigates the impact of visual perturbations on vision-language navigation models and introduces a multi-branch architecture that improves navigation performance by processing diverse visual inputs.
Contribution
It reveals that simple visual perturbations can enhance VLN performance and proposes a versatile multi-branch architecture to leverage multiple visual inputs effectively.
Findings
Simple visual perturbations can improve navigation accuracy.
Multi-branch architecture matches or surpasses state-of-the-art results.
The approach is agnostic to VLN agent topology.
Abstract
Autonomous navigation guided by natural language instructions in embodied environments remains a challenge for vision-language navigation (VLN) agents. Although recent advancements in learning diverse and fine-grained visual environmental representations have shown promise, the fragile performance improvements may not conclusively attribute to enhanced visual grounding,a limitation also observed in related vision-language tasks. In this work, we preliminarily investigate whether advanced VLN models genuinely comprehend the visual content of their environments by introducing varying levels of visual perturbations. These perturbations include ground-truth depth images, perturbed views and random noise. Surprisingly, we experimentally find that simple branch expansion, even with noisy visual inputs, paradoxically improves the navigational efficacy. Inspired by these insights, we further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGaze Tracking and Assistive Technology · Speech and dialogue systems · Advanced Image and Video Retrieval Techniques
MethodsBalanced Selection
