SparseFormer: Detecting Objects in HRW Shots via Sparse Vision   Transformer

Wenxi Li; Yuchen Guo; Jilai Zheng; Haozhe Lin; Chao Ma; Lu Fang,; Xiaokang Yang

arXiv:2502.07216·cs.CV·February 12, 2025

SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer

Wenxi Li, Yuchen Guo, Jilai Zheng, Haozhe Lin, Chao Ma, Lu Fang,, Xiaokang Yang

PDF

TL;DR

SparseFormer is a novel sparse vision transformer designed for object detection in high-resolution wide shots, effectively handling extreme sparsity and scale variations to improve accuracy and efficiency.

Contribution

It introduces a model-agnostic sparse transformer with selective attentive tokens, a cross-slice NMS algorithm, and a multi-scale strategy for HRW shot detection.

Findings

01

Improves detection accuracy by up to 5.8% on benchmarks.

02

Achieves up to 3x faster detection speed.

03

Effectively handles extreme sparsity and scale changes.

Abstract

Recent years have seen an increase in the use of gigapixel-level image and video capture systems and benchmarks with high-resolution wide (HRW) shots. However, unlike close-up shots in the MS COCO dataset, the higher resolution and wider field of view raise unique challenges, such as extreme sparsity and huge scale changes, causing existing close-up detectors inaccuracy and inefficiency. In this paper, we present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots. The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects. In this way, it can jointly explore global and local attention by fusing coarse- and fine-grained features to handle huge scale changes. SparseFormer also benefits from a novel Cross-slice non-maximum…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings