A Closer Look at Temporal Sentence Grounding in Videos: Dataset and Metric
Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, Wenwu Zhu

TL;DR
This paper critically examines the evaluation protocols for Temporal Sentence Grounding in Videos, revealing biases and proposing new dataset splits and metrics to improve benchmarking reliability and progress measurement.
Contribution
It introduces a reorganization of existing datasets with out-of-distribution splits and a new evaluation metric to address biases and better assess TSGV methods.
Findings
Existing benchmarks are biased and unreliable for evaluation.
The new dataset splits and metrics provide more accurate benchmarking.
Reorganized datasets and metrics better reflect true model performance.
Abstract
Temporal Sentence Grounding in Videos (TSGV), i.e., grounding a natural language sentence which indicates complex human activities in a long and untrimmed video sequence, has received unprecedented attentions over the last few years. Although each newly proposed method plausibly can achieve better performance than previous ones, current TSGV models still tend to capture the moment annotation biases and fail to take full advantage of multi-modal inputs. Even more incredibly, several extremely simple baselines without training can also achieve state-of-the-art performance. In this paper, we take a closer look at the existing evaluation protocols for TSGV, and find that both the prevailing dataset splits and evaluation metrics are the devils to cause unreliable benchmarking. To this end, we propose to re-organize two widely-used TSGV benchmarks (ActivityNet Captions and Charades-STA).…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
