Aerial Image Object Detection With Vision Transformer Detector (ViTDet)

Liya Wang; Alex Tien

arXiv:2301.12058·cs.CV·February 3, 2023·1 cites

Aerial Image Object Detection With Vision Transformer Detector (ViTDet)

Liya Wang, Alex Tien

PDF

Open Access 1 Repo

TL;DR

This paper evaluates the effectiveness of the Vision Transformer Detector (ViTDet) for aerial image object detection, demonstrating its superior performance over CNN-based methods across multiple datasets and bounding box types.

Contribution

It provides the first comprehensive study of ViTDet's application to aerial images, showing its consistent outperformance and establishing a baseline for future research.

Findings

01

ViTDet outperforms CNN counterparts by up to 17% in average precision.

02

ViTDet achieves competitive results for oriented bounding box detection.

03

The study offers a baseline for future aerial image detection research.

Abstract

The past few years have seen an increased interest in aerial image object detection due to its critical value to large-scale geo-scientific research like environmental studies, urban planning, and intelligence monitoring. However, the task is very challenging due to the birds-eye view perspective, complex backgrounds, large and various image sizes, different appearances of objects, and the scarcity of well-annotated datasets. Recent advances in computer vision have shown promise tackling the challenge. Specifically, Vision Transformer Detector (ViTDet) was proposed to extract multi-scale features for object detection. The empirical study shows that ViTDet's simple design achieves good performance on natural scene images and can be easily embedded into any detector architecture. To date, ViTDet's potential benefit to challenging aerial image object detection has not been explored.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

MS-P3/code7/tree/main/vitdet
mindspore

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Infrared Target Detection Methodologies · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention · Dense Connections