Rethinking Image-to-Video Adaptation: An Object-centric Perspective
Rui Qian, Shuangrui Ding, Dahua Lin

TL;DR
This paper introduces an object-centric approach to image-to-video adaptation, leveraging object discovery and interaction modeling to improve efficiency and interpretability in video understanding tasks.
Contribution
It proposes a novel object-centric adaptation strategy using slot attention and object-level losses, achieving state-of-the-art results with significantly fewer parameters.
Findings
Achieves state-of-the-art performance on action recognition benchmarks.
Operates with only 5% of the parameters of fully finetuned models.
Performs well in zero-shot video object segmentation without retraining.
Abstract
Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top of the spatial module. However, these attempts are subject to limitations in efficiency and interpretability. In this paper, we propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective. Inspired by human perception, which identifies objects as key components for video understanding, we integrate a proxy task of object discovery into image-to-video transfer learning. Specifically, we adopt slot attention with learnable queries to distill each frame into a compact set of object tokens. These object-centric tokens are then processed through object-time interaction layers to model object state changes across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Video Analysis and Summarization · Cinema and Media Studies
MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training
