CF-VLM:CounterFactual Vision-Language Fine-tuning

Jusheng Zhang; Kaitong Cai; Yijia Fan; Jian Wang; Keze Wang

arXiv:2506.17267·cs.LG·June 24, 2025

CF-VLM:CounterFactual Vision-Language Fine-tuning

Jusheng Zhang, Kaitong Cai, Yijia Fan, Jian Wang, Keze Wang

PDF

TL;DR

CF-VLM introduces a counterfactual training framework that significantly enhances the causal reasoning and discriminative capabilities of vision-language models, leading to better generalization and factual consistency.

Contribution

The paper presents a novel counterfactual fine-tuning method that improves causal reasoning in vision-language models by using targeted counterfactual samples and multiple training objectives.

Findings

01

Outperforms state-of-the-art on reasoning benchmarks

02

Reduces visual hallucinations and improves factual consistency

03

Enhances model robustness and interpretability

Abstract

Recent advances in vision-language models (VLMs) have greatly improved cross-modal semantic understanding, yet significant limitations remain in fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on superficial statistical correlations, lacking the ability to capture the underlying causal logic between visual and textual content. To address this, we propose CounterFactual Vision-Language Fine-tuning (CF-VLM), a novel framework that enhances the causal reasoning capabilities of VLMs through the targeted use of counterfactual samples. CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness and stability of factual scene representations against coherent counterfactuals, and sharpening the model's sensitivity to minimal but critical causal edits. Extensive experiments demonstrate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.