Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rui Qian; Shuangrui Ding; Dahua Lin

arXiv:2407.06871·cs.CV·July 10, 2024

Rethinking Image-to-Video Adaptation: An Object-centric Perspective

Rui Qian, Shuangrui Ding, Dahua Lin

PDF

Open Access

TL;DR

This paper introduces an object-centric approach to image-to-video adaptation, leveraging object discovery and interaction modeling to improve efficiency and interpretability in video understanding tasks.

Contribution

It proposes a novel object-centric adaptation strategy using slot attention and object-level losses, achieving state-of-the-art results with significantly fewer parameters.

Findings

01

Achieves state-of-the-art performance on action recognition benchmarks.

02

Operates with only 5% of the parameters of fully finetuned models.

03

Performs well in zero-shot video object segmentation without retraining.

Abstract

Image-to-video adaptation seeks to efficiently adapt image models for use in the video domain. Instead of finetuning the entire image backbone, many image-to-video adaptation paradigms use lightweight adapters for temporal modeling on top of the spatial module. However, these attempts are subject to limitations in efficiency and interpretability. In this paper, we propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective. Inspired by human perception, which identifies objects as key components for video understanding, we integrate a proxy task of object discovery into image-to-video transfer learning. Specifically, we adopt slot attention with learnable queries to distill each frame into a compact set of object tokens. These object-centric tokens are then processed through object-time interaction layers to model object state changes across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimedia Communication and Technology · Video Analysis and Summarization · Cinema and Media Studies

MethodsSoftmax · Attention Is All You Need · Sparse Evolutionary Training