TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy

Luyao Lei; Shuo Xu; Yifan Bai; Xing Wei

arXiv:2505.12693·cs.CV·May 20, 2025

TACOcc:Target-Adaptive Cross-Modal Fusion with Volume Rendering for 3D Semantic Occupancy

Luyao Lei, Shuo Xu, Yifan Bai, Xing Wei

PDF

Open Access

TL;DR

TACOcc introduces an adaptive cross-modal fusion method with volume rendering supervision to improve 3D semantic occupancy prediction, addressing geometry-semantics mismatch and surface detail loss in multi-modal data.

Contribution

The paper proposes a novel target-scale adaptive bidirectional retrieval mechanism and an improved volume rendering pipeline for enhanced multi-modal 3D occupancy prediction.

Findings

01

Outperforms existing methods on nuScenes and SemanticKITTI benchmarks.

02

Effectively aligns features across modalities with adaptive neighborhood expansion and shrinking.

03

Enhances surface detail reconstruction via volume rendering supervision.

Abstract

The performance of multi-modal 3D occupancy prediction is limited by ineffective fusion, mainly due to geometry-semantics mismatch from fixed fusion strategies and surface detail loss caused by sparse, noisy annotations. The mismatch stems from the heterogeneous scale and distribution of point cloud and image features, leading to biased matching under fixed neighborhood fusion. To address this, we propose a target-scale adaptive, bidirectional symmetric retrieval mechanism. It expands the neighborhood for large targets to enhance context awareness and shrinks it for small ones to improve efficiency and suppress noise, enabling accurate cross-modal feature alignment. This mechanism explicitly establishes spatial correspondences and improves fusion accuracy. For surface detail loss, sparse labels provide limited supervision, resulting in poor predictions for small objects. We introduce an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · 3D Shape Modeling and Analysis · Computer Graphics and Visualization Techniques