Counterfactual Cross-modality Reasoning for Weakly Supervised Video Moment Localization
Zezhong Lv, Bing Su, Ji-Rong Wen

TL;DR
This paper introduces a counterfactual cross-modality reasoning approach to improve weakly supervised video moment localization by reducing spurious correlations and enhancing vision-language alignment.
Contribution
It proposes a novel counterfactual reasoning method that explicitly models and suppresses spurious effects in cross-modality reconstruction for better localization.
Findings
Significant improvement over existing weakly supervised methods
Effective mitigation of spurious correlations in cross-modality learning
Enhanced accuracy in video moment localization
Abstract
Video moment localization aims to retrieve the target segment of an untrimmed video according to the natural language query. Weakly supervised methods gains attention recently, as the precise temporal location of the target segment is not always available. However, one of the greatest challenges encountered by the weakly supervised method is implied in the mismatch between the video and language induced by the coarse temporal annotations. To refine the vision-language alignment, recent works contrast the cross-modality similarities driven by reconstructing masked queries between positive and negative video proposals. However, the reconstruction may be influenced by the latent spurious correlation between the unmasked and the masked parts, which distorts the restoring process and further degrades the efficacy of contrastive learning since the masked words are not completely reconstructed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsContrastive Learning
