Vehicle-centric Perception via Multimodal Structured Pre-training

Wentao Wu; Xiao Wang; Chenglong Li; Jin Tang; Bin Luo

arXiv:2512.19934·cs.CV·December 24, 2025

Vehicle-centric Perception via Multimodal Structured Pre-training

Wentao Wu, Xiao Wang, Chenglong Li, Jin Tang, Bin Luo

PDF

Open Access

TL;DR

This paper introduces VehicleMAE-V2, a large vehicle-centric pre-trained model that leverages multimodal structured priors to improve vehicle perception representations, demonstrating superior performance on multiple downstream tasks.

Contribution

The paper proposes VehicleMAE-V2, incorporating symmetry, contour, and semantic priors into masked token reconstruction for enhanced vehicle perception.

Findings

01

Outperforms existing methods on five downstream tasks.

02

Effectively utilizes multimodal priors for better representation learning.

03

Constructs a large-scale dataset with 4 million images and text descriptions.

Abstract

Vehicle-centric perception plays a crucial role in many intelligent systems, including large-scale surveillance systems, intelligent transportation, and autonomous driving. Existing approaches lack effective learning of vehicle-related knowledge during pre-training, resulting in poor capability for modeling general vehicle perception representations. To handle this problem, we propose VehicleMAE-V2, a novel vehicle-centric pre-trained large model. By exploring and exploiting vehicle-related multimodal structured priors to guide the masked token reconstruction process, our approach can significantly enhance the model's capability to learn generalizable representations for vehicle-centric perception. Specifically, we design the Symmetry-guided Mask Module (SMM), Contour-guided Representation Module (CRM) and Semantics-guided Representation Module (SRM) to incorporate three kinds of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications