Causal Debiasing for Visual Commonsense Reasoning

Jiayi Zou; Gengyun Jia; Bing-Kun Bao

arXiv:2510.20281·cs.CV·October 24, 2025

Causal Debiasing for Visual Commonsense Reasoning

Jiayi Zou, Gengyun Jia, Bing-Kun Bao

PDF

TL;DR

This paper identifies biases in Visual Commonsense Reasoning datasets and proposes a causal debiasing method using backdoor adjustment and a dictionary-based approach to improve model generalization.

Contribution

It introduces VCR-OOD datasets for evaluating cross-modal generalization and applies causal inference techniques to effectively reduce dataset biases.

Findings

01

Debiasing improves model generalization across datasets

02

VCR-OOD datasets reveal existing biases in VCR models

03

Causal methods outperform baseline debiasing approaches

Abstract

Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.