BEACON: Language-Conditioned Navigation Affordance Prediction under Occlusion
Xinyu Gao, Gang Chen, Javier Alonso-Mora

TL;DR
BEACON enables robots to predict navigable target locations in occluded areas by using a bird's-eye view heatmap conditioned on language instructions, improving spatial reasoning beyond visible regions.
Contribution
This work introduces a novel BEV affordance prediction method that incorporates spatial cues into vision-language models to handle occlusions in language-conditioned navigation.
Findings
22.74 percentage points improvement over state-of-the-art
Effective BEV formulation for occlusion reasoning
Validated on occlusion-rich dataset in Habitat simulator
Abstract
Language-conditioned local navigation requires a robot to infer a nearby traversable target location from its current observation and an open-vocabulary, relational instruction. Existing vision-language spatial grounding methods usually rely on vision-language models (VLMs) to reason in image space, producing 2D predictions tied to visible pixels. As a result, they struggle to infer target locations in occluded regions, typically caused by furniture or moving humans. To address this issue, we propose BEACON, which predicts an ego-centric Bird's-Eye View (BEV) affordance heatmap over a bounded local region including occluded areas. Given an instruction and surround-view RGB-D observations from four directions around the robot, BEACON predicts the BEV heatmap by injecting spatial cues into a VLM and fusing the VLM's output with depth-derived BEV features. Using an occlusion-aware dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Advanced Neural Network Applications
