Progressive Sparse Local Attention for Video object detection
Chaoxu Guo, Bin Fan, Jie Gu, Qian Zhang, Shiming Xiang, Veronique, Prinet, Chunhong Pan

TL;DR
This paper introduces PSLA, a novel local attention module that improves video object detection by establishing spatial correspondence without optical flow, leading to better accuracy and efficiency.
Contribution
The paper proposes PSLA, RFU, and DenseFT, novel modules that enhance feature propagation and representation in video object detection without relying on optical flow.
Findings
Achieves state-of-the-art accuracy on ImageNet VID
Uses smaller model size compared to flow-based methods
Maintains acceptable runtime speed
Abstract
Transferring image-based object detectors to the domain of videos remains a challenging problem. Previous efforts mostly exploit optical flow to propagate features across frames, aiming to achieve a good trade-off between accuracy and efficiency. However, introducing an extra model to estimate optical flow can significantly increase the overall model size. The gap between optical flow and high-level features can also hinder it from establishing spatial correspondence accurately. Instead of relying on optical flow, this paper proposes a novel module called Progressive Sparse Local Attention (PSLA), which establishes the spatial correspondence between features across frames in a local region with progressively sparser stride and uses the correspondence to propagate features. Based on PSLA, Recursive Feature Updating (RFU) and Dense Feature Transforming (DenseFT) are proposed to model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques
