$C^3$: Compositional Counterfactual Contrastive Learning for   Video-grounded Dialogues

Hung Le; Nancy F. Chen; Steven C.H. Hoi

arXiv:2106.08914·cs.LG·August 8, 2023

$C^3$: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues

Hung Le, Nancy F. Chen, Steven C.H. Hoi

PDF

Open Access

TL;DR

This paper introduces $C^3$, a novel contrastive learning method for video-grounded dialogues that enhances multimodal reasoning and generalization by leveraging counterfactual samples and object/action-level variance.

Contribution

It proposes a new compositional counterfactual contrastive learning framework that improves video-grounded dialogue systems by focusing on hidden state representations and counterfactual sampling.

Findings

01

Achieved performance gains on AVSD benchmark

02

Improved grounding of video and dialogue context

03

Enhanced multimodal reasoning capabilities

Abstract

Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ( $C^{3}$ ) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling

MethodsContrastive Learning