DSANet: Dynamic Segment Aggregation Network for Video-Level   Representation Learning

Wenhao Wu; Yuxiang Zhao; Yanwu Xu; Xiao Tan; Dongliang He; Zhikang; Zou; Jin Ye; Yingying Li; Mingde Yao; Zichao Dong; Yifeng Shi

arXiv:2105.12085·cs.CV·August 18, 2021

DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning

Wenhao Wu, Yuxiang Zhao, Yanwu Xu, Xiao Tan, Dongliang He, Zhikang, Zou, Jin Ye, Yingying Li, Mingde Yao, Zichao Dong, Yifeng Shi

PDF

Open Access 1 Repo

TL;DR

DSANet introduces a dynamic segment aggregation module that enhances long-range temporal modeling in video recognition, improving accuracy with minimal additional computational cost.

Contribution

The paper proposes a novel DSA module for adaptive long-range temporal feature aggregation, compatible with existing clip-based models, and demonstrates its effectiveness across multiple benchmarks.

Findings

01

DSA improves I3D ResNet-50 top-1 accuracy from 74.9% to 78.2% on Kinetics-400.

02

DSA benefits various models with significant accuracy gains.

03

Extensive experiments validate the effectiveness of DSANet across benchmarks.

Abstract

Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final video-level prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

whwu95/DSANet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization