Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
Minghao Zhu, Xiao Lin, Ronghao Dang, Chengju Liu, and Qijun Chen

TL;DR
This paper introduces FIMA, a fine-grained motion alignment framework for contrastive video learning that enhances motion feature quality through pixel-level supervision and improves temporal diversity, leading to state-of-the-art results.
Contribution
The paper proposes a novel dense contrastive learning framework with a motion decoder and foreground sampling to achieve precise spatiotemporal motion alignment in video representations.
Findings
FIMA achieves state-of-the-art performance on UCF101, HMDB51, and Diving48 datasets.
The method enhances motion-awareness in video representations.
FIMA outperforms existing methods in downstream tasks.
Abstract
As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications
MethodsDense Contrastive Learning · Contrastive Learning · ALIGN
