DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning
Wenhao Wu, Yuxiang Zhao, Yanwu Xu, Xiao Tan, Dongliang He, Zhikang, Zou, Jin Ye, Yingying Li, Mingde Yao, Zichao Dong, Yifeng Shi

TL;DR
DSANet introduces a dynamic segment aggregation module that enhances long-range temporal modeling in video recognition, improving accuracy with minimal additional computational cost.
Contribution
The paper proposes a novel DSA module for adaptive long-range temporal feature aggregation, compatible with existing clip-based models, and demonstrates its effectiveness across multiple benchmarks.
Findings
DSA improves I3D ResNet-50 top-1 accuracy from 74.9% to 78.2% on Kinetics-400.
DSA benefits various models with significant accuracy gains.
Extensive experiments validate the effectiveness of DSANet across benchmarks.
Abstract
Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final video-level prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Video Analysis and Summarization
