Sparse-Dense Side-Tuner for efficient Video Temporal Grounding
David Pujol-Perich, Sergio Escalera, Albert Clap\'es

TL;DR
This paper introduces the Sparse-Dense Side-Tuner, an anchor-free, parameter-efficient framework for Video Temporal Grounding that leverages a novel attention mechanism and a new backbone, achieving state-of-the-art results with fewer parameters.
Contribution
It proposes the first anchor-free, parameter-efficient side-tuning architecture for VTG, incorporating a novel deformable self-attention mechanism and integrating the InternVideo2 backbone.
Findings
Achieves state-of-the-art results on multiple datasets.
Reduces up to 73% of parameters compared to previous methods.
Demonstrates the effectiveness of the proposed attention mechanism.
Abstract
Video Temporal Grounding (VTG) involves Moment Retrieval (MR) and Highlight Detection (HD) based on textual queries. For this, most methods rely solely on final-layer features of frozen large pre-trained backbones, limiting their adaptability to new domains. While full fine-tuning is often impractical, parameter-efficient fine-tuning -- and particularly side-tuning (ST) -- has emerged as an effective alternative. However, prior ST approaches this problem from a frame-level refinement perspective, overlooking the inherent sparse nature of MR. To address this, we propose the Sparse-Dense Side-Tuner (SDST), the first anchor-free ST architecture for VTG. We also introduce the Reference-based Deformable Self-Attention, a novel mechanism that enhances the context modeling of the deformable attention -- a key limitation of existing anchor-free methods. Additionally, we present the first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
