Representation-Level Counterfactual Calibration for Debiased Zero-Shot Recognition
Pei Peng, MingKun Xie, Hang Hao, Tong Jin, ShengJun Huang

TL;DR
This paper introduces a causal inference-based method to improve zero-shot recognition by synthesizing counterfactual embeddings, effectively reducing context-based biases without retraining, and achieving state-of-the-art results.
Contribution
It presents a novel representation-level counterfactual calibration technique that enhances zero-shot model reliability by mitigating object-context shortcuts without retraining.
Findings
Significant accuracy improvements on context-sensitive benchmarks.
Effective reduction of hallucinated scores in zero-shot recognition.
Establishment of a new state-of-the-art in zero-shot performance.
Abstract
Object-context shortcuts remain a persistent challenge in vision-language models, undermining zero-shot reliability when test-time scenes differ from familiar training co-occurrences. We recast this issue as a causal inference problem and ask: Would the prediction remain if the object appeared in a different environment? To answer this at inference time, we estimate object and background expectations within CLIP's representation space, and synthesize counterfactual embeddings by recombining object features with diverse alternative contexts sampled from external datasets, batch neighbors, or text-derived descriptions. By estimating the Total Direct Effect and simulating intervention, we further subtract background-only activation, preserving beneficial object-context interactions while mitigating hallucinated scores. Without retraining or prompt design, our method substantially improves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
