Aerial Image Object Detection With Vision Transformer Detector (ViTDet)
Liya Wang, Alex Tien

TL;DR
This paper evaluates the effectiveness of the Vision Transformer Detector (ViTDet) for aerial image object detection, demonstrating its superior performance over CNN-based methods across multiple datasets and bounding box types.
Contribution
It provides the first comprehensive study of ViTDet's application to aerial images, showing its consistent outperformance and establishing a baseline for future research.
Findings
ViTDet outperforms CNN counterparts by up to 17% in average precision.
ViTDet achieves competitive results for oriented bounding box detection.
The study offers a baseline for future aerial image detection research.
Abstract
The past few years have seen an increased interest in aerial image object detection due to its critical value to large-scale geo-scientific research like environmental studies, urban planning, and intelligence monitoring. However, the task is very challenging due to the birds-eye view perspective, complex backgrounds, large and various image sizes, different appearances of objects, and the scarcity of well-annotated datasets. Recent advances in computer vision have shown promise tackling the challenge. Specifically, Vision Transformer Detector (ViTDet) was proposed to extract multi-scale features for object detection. The empirical study shows that ViTDet's simple design achieves good performance on natural scene images and can be easily embedded into any detector architecture. To date, ViTDet's potential benefit to challenging aerial image object detection has not been explored.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Infrared Target Detection Methodologies · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Softmax · Absolute Position Encodings · Byte Pair Encoding · Adam · Layer Normalization · Label Smoothing · Multi-Head Attention · Dense Connections
