Real-time Online Video Detection with Temporal Smoothing Transformers

Yue Zhao; Philipp Kr\"ahenb\"uhl

arXiv:2209.09236·cs.CV·September 20, 2022·1 cites

Real-time Online Video Detection with Temporal Smoothing Transformers

Yue Zhao, Philipp Kr\"ahenb\"uhl

PDF

Open Access 1 Repo

TL;DR

This paper introduces TeSTra, a novel transformer-based model for real-time video recognition that efficiently captures long-term video dynamics using temporal smoothing kernels, achieving state-of-the-art results with constant computational overhead.

Contribution

The paper proposes a new temporal smoothing attention mechanism for transformers, enabling constant-time updates and improved long-term video modeling in streaming recognition.

Findings

01

TeSTra runs 6 times faster than traditional sliding-window transformers.

02

Achieves state-of-the-art results on THUMOS'14 and EPIC-Kitchen-100 datasets.

03

Real-time TeSTra outperforms most prior methods on online action detection.

Abstract

Streaming video recognition reasons about objects and their actions in every frame of a video. A good streaming recognition model captures both long-term dynamics and short-term changes of video. Unfortunately, in most existing methods, the computational complexity grows linearly or quadratically with the length of the considered dynamics. This issue is particularly pronounced in transformer-based architectures. To address this issue, we reformulate the cross-attention in a video transformer through the lens of kernel and apply two kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The resulting streaming attention reuses much of the computation from frame to frame, and only requires a constant time update each frame. Based on this idea, we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily long inputs with constant caching and computing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaoyue-zephyrus/testra
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Video Surveillance and Tracking Methods

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Residual Connection · Dense Connections