TL;DR
This paper introduces Co-Settle, a lightweight transfer learning framework that balances intra-video temporal consistency and inter-video semantic separability for improved video representation learning from image models.
Contribution
It proposes a novel lightweight projection layer with a cycle consistency and separability constraint, enabling effective self-supervised transfer from images to videos.
Findings
Consistent improvements across multiple video tasks.
Achieves effective transfer with only five epochs of training.
Theoretical support for the trade-off optimization.
Abstract
Recent studies have made notable progress in video representation learning by transferring image-pretrained models to video tasks, typically with complex temporal modules and video fine-tuning. However, fine-tuning heavy modules may compromise inter-video semantic separability, i.e., the essential ability to distinguish objects across videos. While reducing the tunable parameters hinders their intra-video temporal consistency, which is required for stable representations of the same object within a video. This dilemma indicates a potential trade-off between the intra-video temporal consistency and inter-video semantic separability during image-to-video transfer. To this end, we propose the Consistency-Separability Trade-off Transfer Learning (Co-Settle) framework, which applies a lightweight projection layer on top of the frozen image-pretrained encoder to adjust representation space…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
