TransVOD: End-to-End Video Object Detection with Spatial-Temporal   Transformers

Qianyu Zhou; Xiangtai Li; Lu He; Yibo Yang; Guangliang Cheng; Yunhai; Tong; Lizhuang Ma; Dacheng Tao

arXiv:2201.05047·cs.CV·November 23, 2022

TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers

Qianyu Zhou, Xiangtai Li, Lu He, Yibo Yang, Guangliang Cheng, Yunhai, Tong, Lizhuang Ma, Dacheng Tao

PDF

3 Repos

TL;DR

TransVOD introduces an end-to-end video object detection system based on spatial-temporal Transformers, eliminating many traditional components and achieving state-of-the-art accuracy and speed on the ImageNet VID dataset.

Contribution

This paper presents the first end-to-end VOD system using spatial-temporal Transformers, removing hand-crafted components and introducing improved models TransVOD++, and TransVOD Lite.

Findings

01

Boosts deformable DETR baseline by 3-4% mAP on ImageNet VID

02

TransVOD++ achieves 90.0% mAP, setting new state-of-the-art

03

TransVOD Lite balances speed (30 FPS) and accuracy (83.7% mAP)

Abstract

Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Feedforward Network · Dense Connections · Position-Wise Feed-Forward Layer · Multi-Head Attention · Detection Transformer · Convolution · Softmax