VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

Shaobo Wang; Tianle Niu; Runkang Yang; Deshan Liu; Xu He; Zichen Wen; Conghui He; Xuming Hu; Linfeng Zhang

arXiv:2511.18831·cs.CV·November 25, 2025

VideoCompressa: Data-Efficient Video Understanding via Joint Temporal Compression and Spatial Reconstruction

Shaobo Wang, Tianle Niu, Runkang Yang, Deshan Liu, Xu He, Zichen Wen, Conghui He, Xuming Hu, Linfeng Zhang

PDF

Open Access

TL;DR

VideoCompressa introduces a data-efficient video understanding framework that leverages joint temporal compression and spatial reconstruction, significantly reducing data requirements while maintaining high performance.

Contribution

It presents a novel joint optimization approach for keyframe selection and latent compression, improving video data synthesis efficiency and effectiveness.

Findings

01

Surpasses full-data training accuracy with only 0.13% of data on UCF101.

02

Achieves comparable performance to full-data fine-tuning with 0.41% of data on HMDB51.

03

Over 5800x speedup compared to traditional synthesis methods.

Abstract

The scalability of video understanding models is increasingly limited by the prohibitive storage and computational costs of large-scale video datasets. While data synthesis has improved data efficiency in the image domain, its extension to video remains challenging due to pervasive temporal redundancy and complex spatiotemporal dynamics. In this work, we uncover a critical insight: the primary source of inefficiency in video datasets is not inter-sample redundancy, but intra-sample frame-level redundancy. To leverage this insight, we introduce VideoCompressa, a novel framework for video data synthesis that reframes the problem as dynamic latent compression. Specifically, VideoCompressa jointly optimizes a differentiable keyframe selector-implemented as a lightweight ConvNet with Gumbel-Softmax sampling-to identify the most informative frames, and a pretrained, frozen Variational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Pose and Action Recognition