Supervised Contrastive Frame Aggregation for Video Representation Learning

Shaif Chowdhury; Mushfika Rahman; Greg Hamerly

arXiv:2512.12549·cs.CV·December 16, 2025

Supervised Contrastive Frame Aggregation for Video Representation Learning

Shaif Chowdhury, Mushfika Rahman, Greg Hamerly

PDF

Open Access

TL;DR

This paper introduces a supervised contrastive learning framework for video representation that uses a novel frame aggregation strategy to leverage global context, improve accuracy, and reduce computational costs.

Contribution

It presents a new video-to-image aggregation method combined with contrastive learning, enabling effective video representations with pre-trained CNNs and less computational overhead.

Findings

01

Outperforms existing methods in classification accuracy

02

Requires fewer computational resources

03

Achieves 76% accuracy on Penn Action and 48% on HMDB51

Abstract

We propose a supervised contrastive learning framework for video representation learning that leverages temporally global context. We introduce a video to image aggregation strategy that spatially arranges multiple frames from each video into a single input image. This design enables the use of pre trained convolutional neural network backbones such as ResNet50 and avoids the computational overhead of complex video transformer models. We then design a contrastive learning objective that directly compares pairwise projections generated by the model. Positive pairs are defined as projections from videos sharing the same label while all other projections are treated as negatives. Multiple natural views of the same video are created using different temporal frame samplings from the same underlying video. Rather than relying on data augmentation these frame level variations produce diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis