Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video   Representation Learning

Minghao Zhu; Xiao Lin; Ronghao Dang; Chengju Liu; and Qijun Chen

arXiv:2309.00297·cs.CV·October 16, 2024

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning

Minghao Zhu, Xiao Lin, Ronghao Dang, Chengju Liu, and Qijun Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces FIMA, a fine-grained motion alignment framework for contrastive video learning that enhances motion feature quality through pixel-level supervision and improves temporal diversity, leading to state-of-the-art results.

Contribution

The paper proposes a novel dense contrastive learning framework with a motion decoder and foreground sampling to achieve precise spatiotemporal motion alignment in video representations.

Findings

01

FIMA achieves state-of-the-art performance on UCF101, HMDB51, and Diving48 datasets.

02

The method enhances motion-awareness in video representations.

03

FIMA outperforms existing methods in downstream tasks.

Abstract

As the most essential property in a video, motion information is critical to a robust and generalized video representation. To inject motion dynamics, recent works have adopted frame difference as the source of motion information in video contrastive learning, considering the trade-off between quality and cost. However, existing works align motion features at the instance level, which suffers from spatial and temporal weak alignment across modalities. In this paper, we present a \textbf{Fi}ne-grained \textbf{M}otion \textbf{A}lignment (FIMA) framework, capable of introducing well-aligned and significant motion information. Specifically, we first develop a dense contrastive learning framework in the spatiotemporal domain to generate pixel-level motion supervision. Then, we design a motion decoder and a foreground sampling strategy to eliminate the weak alignments in terms of time and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zmhh-h/fima
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Multimodal Machine Learning Applications

MethodsDense Contrastive Learning · Contrastive Learning · ALIGN