Disentangled Counterfactual Learning for Physical Audiovisual   Commonsense Reasoning

Changsheng Lv; Shuai Zhang; Yapeng Tian; Mengshi Qi and; Huadong Ma

arXiv:2310.19559·cs.CV·November 3, 2023·2 cites

Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning

Changsheng Lv, Shuai Zhang, Yapeng Tian, Mengshi Qi and, Huadong Ma

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces a Disentangled Counterfactual Learning approach that enhances physical audiovisual commonsense reasoning by decoupling video features and modeling causal relationships, leading to state-of-the-art results.

Contribution

The paper proposes a novel DCL method that disentangles static and dynamic factors in videos and incorporates counterfactual reasoning to improve multimodal physical commonsense inference.

Findings

01

Achieves state-of-the-art performance on physical audiovisual reasoning tasks.

02

Enhances baseline models with a plug-and-play DCL module.

03

Demonstrates the effectiveness of causal reasoning in multimodal understanding.

Abstract

In this paper, we propose a Disentangled Counterfactual Learning~(DCL) approach for physical audiovisual commonsense reasoning. The task aims to infer objects' physics commonsense based on both video and audio input, with the main challenge is how to imitate the reasoning ability of humans. Most of the current methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes the progress of implicit physical knowledge inferring. To address these issues, our proposed DCL method decouples videos into static (time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual learning module to augment the model's…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning· slideslive

Taxonomy

TopicsMusic and Audio Processing · Video Analysis and Summarization