DMC$^3$: Dual-Modal Counterfactual Contrastive Construction for Egocentric Video Question Answering
Jiayi Zou, Chaofan Chen, Bing-Kun Bao, Changsheng Xu

TL;DR
This paper introduces DMC$^3$, a novel framework for egocentric video question answering that leverages counterfactual contrastive learning to better understand first-person videos and improve answer accuracy.
Contribution
It proposes a dual-modal counterfactual contrastive construction method that enhances egocentric VideoQA by generating and utilizing positive and negative samples for contrastive learning.
Findings
Achieves state-of-the-art results on EgoTaskQA and QAEGO4D datasets.
Effectively models hand-object interactions and multiple events in egocentric videos.
Improves understanding of first-person perspectives in VideoQA tasks.
Abstract
Egocentric Video Question Answering (Egocentric VideoQA) plays an important role in egocentric video understanding, which refers to answering questions based on first-person videos. Although existing methods have made progress through the paradigm of pre-training and fine-tuning, they ignore the unique challenges posed by the first-person perspective, such as understanding multiple events and recognizing hand-object interactions. To deal with these challenges, we propose a Dual-Modal Counterfactual Contrastive Construction (DMC) framework, which contains an egocentric videoqa baseline, a counterfactual sample construction module and a counterfactual sample-involved contrastive optimization. Specifically, We first develop a counterfactual sample construction module to generate positive and negative samples for textual and visual modalities through event description paraphrasing and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
