Robust Disentangled Counterfactual Learning for Physical Audiovisual   Commonsense Reasoning

Mengshi Qi; Changsheng Lv; Huadong Ma

arXiv:2502.12425·cs.CV·February 19, 2025

Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

Mengshi Qi, Changsheng Lv, Huadong Ma

PDF

Open Access 2 Repos

TL;DR

This paper introduces RDCL, a novel approach for physical audiovisual commonsense reasoning that disentangles video features, incorporates counterfactual reasoning, and handles missing modalities to improve robustness and accuracy.

Contribution

The paper presents a disentangled variational autoencoder with counterfactual learning and robust multimodal feature recovery, advancing physical commonsense reasoning under incomplete data scenarios.

Findings

01

Achieves state-of-the-art reasoning accuracy

02

Enhances robustness against missing modality data

03

Improves baseline models with plug-and-play modules

Abstract

In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Media Forensic Detection