CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation
Haihong Hao, Mingfei Han, Changlin Li, Zhihui Li, Xiaojun Chang

TL;DR
CoNav introduces a collaborative cross-modal reasoning framework that enhances embodied navigation by integrating 3D-text guidance with visual cues, leading to significant performance improvements across multiple benchmarks.
Contribution
This work presents a novel framework that explicitly guides navigation agents using 3D-text models, addressing challenges in multi-modal fusion and ambiguity resolution in embodied navigation.
Findings
Significant improvements on four navigation benchmarks.
Effective integration of 3D-text guidance with visual cues.
Shorter paths achieved compared to other methods.
Abstract
Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies
