CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation

Haihong Hao; Mingfei Han; Changlin Li; Zhihui Li; Xiaojun Chang

arXiv:2505.16663·cs.CV·May 23, 2025

CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation

Haihong Hao, Mingfei Han, Changlin Li, Zhihui Li, Xiaojun Chang

PDF

Open Access 1 Repo

TL;DR

CoNav introduces a collaborative cross-modal reasoning framework that enhances embodied navigation by integrating 3D-text guidance with visual cues, leading to significant performance improvements across multiple benchmarks.

Contribution

This work presents a novel framework that explicitly guides navigation agents using 3D-text models, addressing challenges in multi-modal fusion and ambiguity resolution in embodied navigation.

Findings

01

Significant improvements on four navigation benchmarks.

02

Effective integration of 3D-text guidance with visual cues.

03

Shorter paths achieved compared to other methods.

Abstract

Embodied navigation demands comprehensive scene understanding and precise spatial reasoning. While image-text models excel at interpreting pixel-level color and lighting cues, 3D-text models capture volumetric structure and spatial relationships. However, unified fusion approaches that jointly fuse 2D images, 3D point clouds, and textual instructions face challenges in limited availability of triple-modality data and difficulty resolving conflicting beliefs among modalities. In this work, we introduce CoNav, a collaborative cross-modal reasoning framework where a pretrained 3D-text model explicitly guides an image-text navigation agent by providing structured spatial-semantic knowledge to resolve ambiguities during navigation. Specifically, we introduce Cross-Modal Belief Alignment, which operationalizes this cross-modal guidance by simply sharing textual hypotheses from the 3D-text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

oceanhao/CoNav
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems · Semantic Web and Ontologies