MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Zitian Wang; Zehao Huang; Yulu Gao; Naiyan Wang; Si Liu

arXiv:2408.05945·cs.CV·July 4, 2025·5 cites

MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection

Zitian Wang, Zehao Huang, Yulu Gao, Naiyan Wang, Si Liu

PDF

Open Access

TL;DR

MV2DFusion is a novel multi-modal 3D detection framework that effectively combines camera and LiDAR data through modality-specific semantics and query-based fusion, achieving state-of-the-art results in autonomous vehicle scenarios.

Contribution

It introduces a flexible, query-based fusion mechanism that integrates modality-specific object semantics for improved multi-modal 3D detection.

Findings

01

Achieves state-of-the-art performance on nuScenes and Argoverse2 datasets.

02

Excels in long-range detection scenarios.

03

Demonstrates flexibility to integrate with various detectors.

Abstract

The rise of autonomous vehicles has significantly increased the demand for robust 3D object detection systems. While cameras and LiDAR sensors each offer unique advantages--cameras provide rich texture information and LiDAR offers precise 3D spatial data--relying on a single modality often leads to performance limitations. This paper introduces MV2DFusion, a multi-modal detection framework that integrates the strengths of both worlds through an advanced query-based fusion mechanism. By introducing an image query generator to align with image-specific attributes and a point cloud query generator, MV2DFusion effectively combines modality-specific object semantics without biasing toward one single modality. Then the sparse fusion process can be accomplished based on the valuable object semantics, ensuring efficient and accurate object detection across various scenarios. Our framework's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Handwritten Text Recognition Techniques · 3D Shape Modeling and Analysis

MethodsALIGN