Weakly-Supervised Video Object Grounding via Causal Intervention

Wei Wang; Junyu Gao; Changsheng Xu

arXiv:2112.00475·cs.CV·December 2, 2021·1 cites

Weakly-Supervised Video Object Grounding via Causal Intervention

Wei Wang, Junyu Gao, Changsheng Xu

PDF

Open Access

TL;DR

This paper introduces a causal intervention framework for weakly-supervised video object grounding, effectively reducing spurious associations and improving robustness and accuracy in localization tasks.

Contribution

It proposes a novel causal framework with spatial-temporal adversarial contrastive learning and backdoor adjustment to deconfound object-relevant associations in weak supervision.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Demonstrates robustness against distribution shifts and out-of-distribution data.

03

Outperforms existing methods in accuracy and reliability.

Abstract

We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning. It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning. Despite the recent progress, existing methods all suffer from the severe problem of spurious association, which will harm the grounding performance. In this paper, we start from the definition of WSVOG and pinpoint the spurious association from two aspects: (1) the association itself is not object-relevant but extremely ambiguous due to weak supervision, and (2) the association is unavoidably confounded by the observational bias when taking the statistics-based matching strategy in existing methods. With this in mind, we design a unified causal framework to learn the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Domain Adaptation and Few-Shot Learning

MethodsContrastive Learning