Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
Mengshi Qi, Changsheng Lv, Huadong Ma

TL;DR
This paper introduces RDCL, a novel approach for physical audiovisual commonsense reasoning that disentangles video features, incorporates counterfactual reasoning, and handles missing modalities to improve robustness and accuracy.
Contribution
The paper presents a disentangled variational autoencoder with counterfactual learning and robust multimodal feature recovery, advancing physical commonsense reasoning under incomplete data scenarios.
Findings
Achieves state-of-the-art reasoning accuracy
Enhances robustness against missing modality data
Improves baseline models with plug-and-play modules
Abstract
In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDigital Media Forensic Detection
