Beyond Accuracy: On the Effects of Fine-tuning Towards Vision-Language Model's Prediction Rationality
Qitong Wang, Tang Li, Kien X. Nguyen, Xi Peng

TL;DR
This paper investigates how fine-tuning vision-language models affects their prediction rationality, revealing that while fine-tuning can improve correctness when valid evidence is used, it may also increase reliance on invalid evidence, impacting trustworthiness.
Contribution
The study introduces two new metrics for assessing prediction rationality and provides extensive experimental analysis on the effects of fine-tuning VLMs in safety-critical contexts.
Findings
Fine-tuning can lead to more correct predictions based on invalid evidence.
Fine-tuned VLMs are more likely to make correct predictions when valid evidence is present.
Results are consistent across different settings and distributional shifts.
Abstract
Vision-Language Models (VLMs), such as CLIP, have already seen widespread applications. Researchers actively engage in further fine-tuning VLMs in safety-critical domains. In these domains, prediction rationality is crucial: the prediction should be correct and based on valid evidence. Yet, for VLMs, the impact of fine-tuning on prediction rationality is seldomly investigated. To study this problem, we proposed two new metrics called Prediction Trustworthiness and Inference Reliability. We conducted extensive experiments on various settings and observed some interesting phenomena. On the one hand, we found that the well-adopted fine-tuning methods led to more correct predictions based on invalid evidence. This potentially undermines the trustworthiness of correct predictions from fine-tuned VLMs. On the other hand, having identified valid evidence of target objects, fine-tuned VLMs were…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGeographic Information Systems Studies · Natural Language Processing Techniques · Speech and dialogue systems
MethodsContrastive Language-Image Pre-training
