A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos
Allen He, Qi Liu, Kun Liu, Xinchen Liu, Wu Liu

TL;DR
This paper introduces a fully end-to-end training paradigm for temporal sentence grounding in videos, jointly optimizing video backbones and localization heads, with a novel adapter to enhance visual features.
Contribution
It proposes an end-to-end training framework with a Sentence Conditioned Adapter (SCADA) to improve video backbone adaptation for TSGV tasks.
Findings
End-to-end training outperforms frozen baseline models.
SCADA enhances visual representation and enables deeper backbones.
Our method surpasses state-of-the-art on two benchmarks.
Abstract
Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
