Towards Debiasing Temporal Sentence Grounding in Video

Hao Zhang; Aixin Sun; Wei Jing; Joey Tianyi Zhou

arXiv:2111.04321·cs.CV·November 9, 2021·6 cites

Towards Debiasing Temporal Sentence Grounding in Video

Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou

PDF

Open Access

TL;DR

This paper introduces data and model debiasing strategies for temporal sentence grounding in videos, improving model generalization by reducing bias and enhancing cross-modal reasoning.

Contribution

It proposes novel debiasing techniques, including data oversampling and leveraging bias models, to enhance cross-modal understanding in TSGV models.

Findings

01

Both strategies improve generalization on out-of-distribution data.

02

Combined strategies achieve state-of-the-art results.

03

Debiasing enhances cross-modal reasoning capabilities.

Abstract

The temporal sentence grounding in video (TSGV) task is to locate a temporal moment from an untrimmed video, to match a language query, i.e., a sentence. Without considering bias in moment annotations (e.g., start and end positions in a video), many models tend to capture statistical regularities of the moment annotations, and do not well learn cross-modal reasoning between video and language query. In this paper, we propose two debiasing strategies, data debiasing and model debiasing, to "force" a TSGV model to capture cross-modal interactions. Data debiasing performs data oversampling through video truncation to balance moment temporal distribution in train set. Model debiasing leverages video-only and query-only models to capture the distribution bias, and forces the model to learn cross-modal interactions. Using VSLNet as the base model, we evaluate impact of the two strategies on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization