Diversified Augmentation with Domain Adaptation for Debiased Video Temporal Grounding
Junlong Ren, Gangjian Zhang, Haifeng Sun, Hao Wang

TL;DR
This paper introduces a novel training framework for temporal sentence grounding in videos that uses diversified data augmentation and domain adaptation to improve generalization and reduce bias, achieving state-of-the-art results.
Contribution
The paper proposes a new method combining diversified data augmentation with a domain discriminator to address temporal bias and improve model robustness in video grounding.
Findings
Achieves state-of-the-art performance on Charades-CD and ActivityNet-CD datasets.
Effectively reduces temporal bias and improves generalization across different grounding structures.
Demonstrates robustness in videos with varying lengths and target moment locations.
Abstract
Temporal sentence grounding in videos (TSGV) faces challenges due to public TSGV datasets containing significant temporal biases, which are attributed to the uneven temporal distributions of target moments. Existing methods generate augmented videos, where target moments are forced to have varying temporal locations. However, since the video lengths of the given datasets have small variations, only changing the temporal locations results in poor generalization ability in videos with varying lengths. In this paper, we propose a novel training framework complemented by diversified data augmentation and a domain discriminator. The data augmentation generates videos with various lengths and target moment locations to diversify temporal distributions. However, augmented videos inevitably exhibit distinct feature distributions which may introduce noise. To address this, we design a domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Advanced Vision and Imaging
