$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video   Temporal Grounding

Ye Liu; Jixuan He; Wanhua Li; Junsik Kim; Donglai Wei; Hanspeter; Pfister; Chang Wen Chen

arXiv:2404.00801·cs.CV·July 23, 2024·2 cites

$R^2$-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter, Pfister, Chang Wen Chen

PDF

Open Access 1 Repo

TL;DR

This paper introduces R^2-Tuning, a lightweight transfer learning framework that leverages CLIP's layered features for efficient and state-of-the-art video temporal grounding without extra backbones.

Contribution

It proposes a novel R^2 Block that progressively refines spatial and temporal features from CLIP layers, achieving superior performance with minimal parameters.

Findings

01

State-of-the-art results on six benchmarks

02

Effective without additional temporal backbones

03

Parameter-efficient with only 1.5% of total parameters

Abstract

Video temporal grounding (VTG) is a fine-grained video understanding problem that aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models are built upon frame-wise final-layer CLIP features, aided by additional temporal backbones (e.g., SlowFast) with sophisticated temporal reasoning mechanisms. In this work, we claim that CLIP itself already shows great potential for fine-grained spatial-temporal modeling, as each layer offers distinct yet useful information under different granularity levels. Motivated by this, we propose Reversed Recurrent Tuning ( $R^{2}$ -Tuning), a parameter- and memory-efficient transfer learning framework for video temporal grounding. Our method learns a lightweight $R^{2}$ Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, $R^{2}$ …

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yeliudev/R2-Tuning
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Advanced Image Processing Techniques

MethodsContrastive Language-Image Pre-training