TL;DR
This paper investigates a lightweight approach to video pre-training by freezing an image foundation model and training only a temporal module, aiming to reduce data and compute costs while maintaining strong performance.
Contribution
It introduces a novel paradigm of reusing pre-trained image models with a frozen spatial encoder and training a recurrent temporal module for video understanding.
Findings
Strong temporal performance achieved without large-scale video pre-training
Reusing image foundation models reduces data and compute requirements
Empirical results across multiple tasks support the approach's feasibility
Abstract
Video foundation models achieve strong performance across many video understanding tasks, but typically require large-scale pre-training on massive video datasets, resulting in substantial data and compute costs. In contrast, modern image foundation models already provide powerful spatial representations. This raises an important question: can competitive video models be built by reusing these spatial representations and pre-training only for temporal reasoning? We take initial steps toward exploring a lightweight training paradigm that freezes a pre-trained image foundation model and trains only a recurrent temporal module to process streaming video. By reusing an image foundation model as a spatial encoder, this approach could significantly reduce the amount of video data and compute required compared to end-to-end video pre-training. In this work, we explore the feasibility of this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
