CF-VLM:CounterFactual Vision-Language Fine-tuning
Jusheng Zhang, Kaitong Cai, Yijia Fan, Jian Wang, Keze Wang

TL;DR
CF-VLM introduces a counterfactual training framework that significantly enhances the causal reasoning and discriminative capabilities of vision-language models, leading to better generalization and factual consistency.
Contribution
The paper presents a novel counterfactual fine-tuning method that improves causal reasoning in vision-language models by using targeted counterfactual samples and multiple training objectives.
Findings
Outperforms state-of-the-art on reasoning benchmarks
Reduces visual hallucinations and improves factual consistency
Enhances model robustness and interpretability
Abstract
Recent advances in vision-language models (VLMs) have greatly improved cross-modal semantic understanding, yet significant limitations remain in fine-grained discrimination and deep causal reasoning tasks. Existing VLMs often rely on superficial statistical correlations, lacking the ability to capture the underlying causal logic between visual and textual content. To address this, we propose CounterFactual Vision-Language Fine-tuning (CF-VLM), a novel framework that enhances the causal reasoning capabilities of VLMs through the targeted use of counterfactual samples. CF-VLM introduces three complementary training objectives: maintaining foundational cross-modal alignment, reinforcing the uniqueness and stability of factual scene representations against coherent counterfactuals, and sharpening the model's sensitivity to minimal but critical causal edits. Extensive experiments demonstrate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
