Increasing the Efficiency of DETR for Maritime High-Resolution Images
Tinsae Yehuala, Hao Cheng, Ville Lehtola

TL;DR
This paper enhances DETR's efficiency for maritime high-resolution images by integrating Vision Mamba backbones, token pruning, and a tailored feature pyramid network, enabling accurate real-time object detection on resource-limited platforms.
Contribution
It introduces a novel combination of Vision Mamba backbones and optimized network design to improve detection accuracy and efficiency for high-resolution maritime imagery.
Findings
Outperforms RT-DETR with ResNet50 in accuracy and efficiency
Achieves real-time detection on high-resolution maritime images
Reduces computational load via token pruning and specialized network design
Abstract
Maritime object detection is critical for the safe navigation of unmanned surface vessels (USVs), requiring accurate recognition of obstacles from small buoys to large vessels. Real-time detection is challenging due to long distances, small object sizes, large-scale variations, edge computing limitations, and the high memory demands of high-resolution imagery. Existing solutions, such as downsampling or image splitting, often reduce accuracy or require additional processing, while memory-efficient models typically handle only limited resolutions. To overcome these limitations, we leverage Vision Mamba (ViM) backbones, which build on State Space Models (SSMs) to capture long-range dependencies while scaling linearly with sequence length. Images are tokenized into sequences for efficient high-resolution processing. For further computational efficiency, we design a tailored Feature Pyramid…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
