End-to-End Video Object Detection with Spatial-Temporal Transformers
Lu He, Qianyu Zhou, Xiangtai Li, Li Niu, Guangliang Cheng, Xiao Li,, Wenxuan Liu, Yunhai Tong, Lizhuang Ma, Liqing Zhang

TL;DR
TransVOD introduces a streamlined, end-to-end video object detection model using spatial-temporal Transformers, eliminating complex components and post-processing, and achieving significant performance improvements on the ImageNet VID dataset.
Contribution
The paper presents TransVOD, a novel end-to-end VOD model based on spatial-temporal Transformers that simplifies the pipeline and improves accuracy without hand-crafted components.
Findings
Boosts deformable DETR baseline by 3-4% mAP on ImageNet VID
Achieves comparable performance with state-of-the-art methods
Removes need for optical flow and post-processing in VOD
Abstract
Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Deformable Attention Module · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Deformable DETR · Feedforward Network · Convolution · Residual Connection · Adam
