VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models
Wentao Wu, Fanghua Hong, Xiao Wang, Chenglong Li, Jin Tang

TL;DR
VFM-Det introduces a novel vehicle detection method leveraging a pre-trained vehicle model and large language models to enhance detection accuracy by aligning semantic attributes with visual features.
Contribution
The paper proposes VFM-Det, a new vehicle detection framework that integrates a pre-trained vehicle model and semantic attribute prediction to improve detection performance.
Findings
Achieved +5.1% AP_{0.5} on Cityscapes
Achieved +6.2% AP_{0.75} on Cityscapes
Demonstrated effectiveness across three benchmark datasets
Abstract
Existing vehicle detectors are usually obtained by training a typical detector (e.g., YOLO, RCNN, DETR series) on vehicle images based on a pre-trained backbone (e.g., ResNet, ViT). Some researchers also exploit and enhance the detection performance using pre-trained large foundation models. However, we think these detectors may only get sub-optimal results because the large models they use are not specifically designed for vehicles. In addition, their results heavily rely on visual features, and seldom of they consider the alignment between the vehicle's semantic information and visual representations. In this work, we propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det. It follows the region proposal-based detection framework and the features of each proposal can be enhanced using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Automated Road and Building Extraction · Video Surveillance and Tracking Methods
MethodsAttention Is All You Need · Average Pooling · Linear Layer · Adam · Layer Normalization · Feedforward Network · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention
