Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

David Pujol-Perich; Sergio Escalera; Albert Clap\'es

arXiv:2507.07744·cs.CV·July 11, 2025

Sparse-Dense Side-Tuner for efficient Video Temporal Grounding

David Pujol-Perich, Sergio Escalera, Albert Clap\'es

PDF

Open Access

TL;DR

This paper introduces the Sparse-Dense Side-Tuner, an anchor-free, parameter-efficient framework for Video Temporal Grounding that leverages a novel attention mechanism and a new backbone, achieving state-of-the-art results with fewer parameters.

Contribution

It proposes the first anchor-free, parameter-efficient side-tuning architecture for VTG, incorporating a novel deformable self-attention mechanism and integrating the InternVideo2 backbone.

Findings

01

Achieves state-of-the-art results on multiple datasets.

02

Reduces up to 73% of parameters compared to previous methods.

03

Demonstrates the effectiveness of the proposed attention mechanism.

Abstract

Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques