Towards Debiasing Temporal Sentence Grounding in Video
Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou

TL;DR
This paper introduces data and model debiasing strategies for temporal sentence grounding in videos, improving model generalization by reducing bias and enhancing cross-modal reasoning.
Contribution
It proposes novel debiasing techniques, including data oversampling and leveraging bias models, to enhance cross-modal understanding in TSGV models.
Findings
Both strategies improve generalization on out-of-distribution data.
Combined strategies achieve state-of-the-art results.
Debiasing enhances cross-modal reasoning capabilities.
Abstract
The temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query, i.e., a sentence. Without considering bias in moment annotations (e.g., start and end positions in a video), many models tend to capture statistical regularities of the moment annotations, and do not well learn cross-modal reasoning between video and language query. In this paper, we propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions. Data debiasing performs data oversampling through video truncation to balance moment temporal distribution in train set. Model debiasing leverages video-only and query-only models to capture the distribution bias, and forces the model to learn cross-modal interactions. Using VSLNet as the base model, we evaluate impact of the two strategies on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
