TL;DR
TransVOD introduces an end-to-end video object detection system based on spatial-temporal Transformers, eliminating many traditional components and achieving state-of-the-art accuracy and speed on the ImageNet VID dataset.
Contribution
This paper presents the first end-to-end VOD system using spatial-temporal Transformers, removing hand-crafted components and introducing improved models TransVOD++, and TransVOD Lite.
Findings
Boosts deformable DETR baseline by 3-4% mAP on ImageNet VID
TransVOD++ achieves 90.0% mAP, setting new state-of-the-art
TransVOD Lite balances speed (30 FPS) and accuracy (83.7% mAP)
Abstract
Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Feedforward Network · Dense Connections · Position-Wise Feed-Forward Layer · Multi-Head Attention · Detection Transformer · Convolution · Softmax
