Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Changsheng Lv, Shuai Zhang, Yapeng Tian, Mengshi Qi and, Huadong Ma

TL;DR
This paper introduces a Disentangled Counterfactual Learning approach that enhances physical audiovisual commonsense reasoning by decoupling video features and modeling causal relationships, leading to state-of-the-art results.
Contribution
The paper proposes a novel DCL method that disentangles static and dynamic factors in videos and incorporates counterfactual reasoning to improve multimodal physical commonsense inference.
Findings
Achieves state-of-the-art performance on physical audiovisual reasoning tasks.
Enhances baseline models with a plug-and-play DCL module.
Demonstrates the effectiveness of causal reasoning in multimodal understanding.
Abstract
In this paper, we propose a Disentangled Counterfactual Learning~(DCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge is how to imitate the reasoning ability of humans. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed DCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMusic and Audio Processing · Video Analysis and Summarization
