End-to-End Video Object Detection with Spatial-Temporal Transformers

Lu He; Qianyu Zhou; Xiangtai Li; Li Niu; Guangliang Cheng; Xiao Li,; Wenxuan Liu; Yunhai Tong; Lizhuang Ma; Liqing Zhang

arXiv:2105.10920·cs.CV·May 25, 2021·6 cites

End-to-End Video Object Detection with Spatial-Temporal Transformers

Lu He, Qianyu Zhou, Xiangtai Li, Li Niu, Guangliang Cheng, Xiao Li,, Wenxuan Liu, Yunhai Tong, Lizhuang Ma, Liqing Zhang

PDF

Open Access 1 Repo

TL;DR

TransVOD introduces a streamlined, end-to-end video object detection model using spatial-temporal Transformers, eliminating complex components and post-processing, and achieving significant performance improvements on the ImageNet VID dataset.

Contribution

The paper presents TransVOD, a novel end-to-end VOD model based on spatial-temporal Transformers that simplifies the pipeline and improves accuracy without hand-crafted components.

Findings

01

Boosts deformable DETR baseline by 3-4% mAP on ImageNet VID

02

Achieves comparable performance with state-of-the-art methods

03

Removes need for optical flow and post-processing in VOD

Abstract

Recently, DETR and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, an end-to-end video object detection model based on a spatial-temporal Transformer architecture. The goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow, recurrent neural networks, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS or Tubelet rescoring, which keeps the pipeline simple and clean. In particular, we present temporal Transformer to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

SJTU-LuHe/TransVOD
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Deformable Attention Module · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Deformable DETR · Feedforward Network · Convolution · Residual Connection · Adam