HeightFormer: A Semantic Alignment Monocular 3D Object Detection Method   from Roadside Perspective

Pei Liu (1); Zihao Zhang (2); Haipeng Liu (3); Nanfang Zheng; (4); Meixin Zhu (1); Ziyuan Pu (4) ((1) Intelligent Transportation; Thrust; Systems Hub; The Hong Kong University of Science; Technology; (Guangzhou); (2) School of Cyber Science; Engineering; Southeast; University; (3) Li Auto Inc; (4) School of Transportation; Southeast; University)

arXiv:2410.07758·cs.CV·October 22, 2024

HeightFormer: A Semantic Alignment Monocular 3D Object Detection Method from Roadside Perspective

Pei Liu (1), Zihao Zhang (2), Haipeng Liu (3), Nanfang Zheng, (4), Meixin Zhu (1), Ziyuan Pu (4) ((1) Intelligent Transportation, Thrust, Systems Hub, The Hong Kong University of Science, Technology, (Guangzhou), (2) School of Cyber Science, Engineering, Southeast, University

PDF

Open Access

TL;DR

HeightFormer introduces a novel 3D object detection framework for roadside sensors that improves accuracy by better height alignment and feature extraction, enhancing autonomous driving safety.

Contribution

The paper presents a new framework integrating Spatial Former and Voxel Pooling Former for improved 3D detection from roadside perspectives, addressing height alignment and feature extraction efficiency.

Findings

01

Outperforms existing methods on Rope3D and DAIR-V2X-I datasets.

02

Demonstrates robustness across various detection scenarios.

03

Enhances 3D detection accuracy for vehicles and cyclists.

Abstract

The on-board 3D object detection technology has received extensive attention as a critical technology for autonomous driving, while few studies have focused on applying roadside sensors in 3D traffic object detection. Existing studies achieve the projection of 2D image features to 3D features through height estimation based on the frustum. However, they did not consider the height alignment and the extraction efficiency of bird's-eye-view features. We propose a novel 3D object detection framework integrating Spatial Former and Voxel Pooling Former to enhance 2D-to-3D projection based on height estimation. Extensive experiments were conducted using the Rope3D and DAIR-V2X-I dataset, and the results demonstrated the outperformance of the proposed algorithm in the detection of both vehicles and cyclists. These results indicate that the algorithm is robust and generalized under various…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Image Processing and 3D Reconstruction

MethodsSoftmax · Attention Is All You Need