VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation   Models

Wentao Wu; Fanghua Hong; Xiao Wang; Chenglong Li; Jin Tang

arXiv:2408.13031·cs.CV·August 26, 2024

VFM-Det: Towards High-Performance Vehicle Detection via Large Foundation Models

Wentao Wu, Fanghua Hong, Xiao Wang, Chenglong Li, Jin Tang

PDF

Open Access 1 Repo

TL;DR

VFM-Det introduces a novel vehicle detection method leveraging a pre-trained vehicle model and large language models to enhance detection accuracy by aligning semantic attributes with visual features.

Contribution

The paper proposes VFM-Det, a new vehicle detection framework that integrates a pre-trained vehicle model and semantic attribute prediction to improve detection performance.

Findings

01

Achieved +5.1% AP_{0.5} on Cityscapes

02

Achieved +6.2% AP_{0.75} on Cityscapes

03

Demonstrated effectiveness across three benchmark datasets

Abstract

Existing vehicle detectors are usually obtained by training a typical detector (e.g., YOLO, RCNN, DETR series) on vehicle images based on a pre-trained backbone (e.g., ResNet, ViT). Some researchers also exploit and enhance the detection performance using pre-trained large foundation models. However, we think these detectors may only get sub-optimal results because the large models they use are not specifically designed for vehicles. In addition, their results heavily rely on visual features, and seldom of they consider the alignment between the vehicle's semantic information and visual representations. In this work, we propose a new vehicle detection paradigm based on a pre-trained foundation vehicle model (VehicleMAE) and a large language model (T5), termed VFM-Det. It follows the region proposal-based detection framework and the features of each proposal can be enhanced using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

event-ahu/vfm-det
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Automated Road and Building Extraction · Video Surveillance and Tracking Methods

MethodsAttention Is All You Need · Average Pooling · Linear Layer · Adam · Layer Normalization · Feedforward Network · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Multi-Head Attention