TL;DR
This paper introduces Voxelized 3D Feature Aggregation (VFA), a novel method for multi-view detection that improves feature projection accuracy by considering object height and shape, leading to better detection performance.
Contribution
VFA voxelizes 3D space for feature aggregation, incorporating object height and shape through oriented Gaussian encoding, advancing multi-view detection accuracy and efficiency.
Findings
Outperforms state-of-the-art methods on multiple datasets
Effective in multi-view 2D and 3D detection tasks
Introduces the MultiviewC dataset for benchmarking
Abstract
Multi-view detection incorporates multiple camera views to alleviate occlusion in crowded scenes, where the state-of-the-art approaches adopt homography transformations to project multi-view features to the ground plane. However, we find that these 2D transformations do not take into account the object's height, and with this neglection features along the vertical direction of same object are likely not projected onto the same ground plane point, leading to impure ground-plane features. To solve this problem, we propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection. Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
