$C^3$: Compositional Counterfactual Contrastive Learning for Video-grounded Dialogues
Hung Le, Nancy F. Chen, Steven C.H. Hoi

TL;DR
This paper introduces $C^3$, a novel contrastive learning method for video-grounded dialogues that enhances multimodal reasoning and generalization by leveraging counterfactual samples and object/action-level variance.
Contribution
It proposes a new compositional counterfactual contrastive learning framework that improves video-grounded dialogue systems by focusing on hidden state representations and counterfactual sampling.
Findings
Achieved performance gains on AVSD benchmark
Improved grounding of video and dialogue context
Enhanced multimodal reasoning capabilities
Abstract
Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning () to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Topic Modeling
MethodsContrastive Learning
