Calibrated Self-Rewarding Vision Language Models
Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen,, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao

TL;DR
This paper introduces Calibrated Self-Rewarding (CSR), a novel method for improving vision-language models by self-generating and evaluating responses with visual constraints, significantly reducing hallucinations and enhancing alignment.
Contribution
The paper proposes CSR, a self-improving approach that incorporates visual constraints into reward modeling, enabling models to iteratively enhance performance without external preference data.
Findings
CSR improves performance across ten benchmarks by 7.62%.
It reduces hallucinations and enhances modality alignment.
The method is compatible with various vision-language models.
Abstract
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Categorization, perception, and language
