Vehicle-centric Perception via Multimodal Structured Pre-training
Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo

TL;DR
This paper introduces VehicleMAE-V2, a large vehicle-centric pre-trained model that leverages multimodal structured priors to improve vehicle perception representations, demonstrating superior performance on multiple downstream tasks.
Contribution
The paper proposes VehicleMAE-V2, incorporating symmetry, contour, and semantic priors into masked token reconstruction for enhanced vehicle perception.
Findings
Outperforms existing methods on five downstream tasks.
Effectively utilizes multimodal priors for better representation learning.
Constructs a large-scale dataset with 4 million images and text descriptions.
Abstract
Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model's capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
