A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Allen He; Qi Liu; Kun Liu; Xinchen Liu; Wu Liu

arXiv:2604.02860·cs.CV·April 6, 2026

A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos

Allen He, Qi Liu, Kun Liu, Xinchen Liu, Wu Liu

PDF

TL;DR

This paper introduces a fully end-to-end training paradigm for temporal sentence grounding in videos, jointly optimizing video backbones and localization heads, with a novel adapter to enhance visual features.

Contribution

It proposes an end-to-end training framework with a Sentence Conditioned Adapter (SCADA) to improve video backbone adaptation for TSGV tasks.

Findings

01

End-to-end training outperforms frozen baseline models.

02

SCADA enhances visual representation and enables deeper backbones.

03

Our method surpasses state-of-the-art on two benchmarks.

Abstract

Temporal sentence grounding in videos (TSGV) aims to localize a temporal segment that semantically corresponds to a sentence query from an untrimmed video. Most current methods adopt pre-trained query-agnostic visual encoders for offline feature extraction, and the video backbones are frozen and not optimized for TSGV. This leads to a task discrepancy issue for the video backbone trained for visual classification, but utilized for TSGV. To bridge this gap, we propose a fully end-to-end paradigm that jointly optimizes the video backbone and localization head. We first conduct an empirical study validating the effectiveness of end-to-end learning over frozen baselines across different model scales. Furthermore, we introduce a Sentence Conditioned Adapter (SCADA), which leverages sentence features to train a small portion of video backbone parameters adaptively. SCADA facilitates the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.