What Can Simple Arithmetic Operations Do for Temporal Modeling?

Wenhao Wu; Yuxin Song; Zhun Sun; Jingdong Wang; Chang Xu; Wanli Ouyang

arXiv:2307.08908·cs.CV·August 23, 2023

What Can Simple Arithmetic Operations Do for Temporal Modeling?

Wenhao Wu, Yuxin Song, Zhun Sun, Jingdong Wang, Chang Xu, Wanli Ouyang

PDF

Open Access 2 Repos

TL;DR

This paper introduces a simple, plug-and-play Arithmetic Temporal Module (ATM) that uses basic arithmetic operations on frame features to enhance temporal modeling in videos, achieving high accuracy with low computational cost.

Contribution

The work proposes a novel ATM that leverages simple arithmetic operations for effective temporal modeling, compatible with CNNs and ViTs, and demonstrates superior performance on video benchmarks.

Findings

01

ATM improves temporal modeling with low computational cost.

02

Achieves state-of-the-art accuracy on video benchmarks.

03

Compatible with various neural network architectures.

Abstract

Temporal modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrespective domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-and-play style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Video Analysis and Summarization