Calibrated Self-Rewarding Vision Language Models

Yiyang Zhou; Zhiyuan Fan; Dongjie Cheng; Sihan Yang; Zhaorun Chen,; Chenhang Cui; Xiyao Wang; Yun Li; Linjun Zhang; Huaxiu Yao

arXiv:2405.14622·cs.LG·November 5, 2024·1 cites

Calibrated Self-Rewarding Vision Language Models

Yiyang Zhou, Zhiyuan Fan, Dongjie Cheng, Sihan Yang, Zhaorun Chen,, Chenhang Cui, Xiyao Wang, Yun Li, Linjun Zhang, Huaxiu Yao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces Calibrated Self-Rewarding (CSR), a novel method for improving vision-language models by self-generating and evaluating responses with visual constraints, significantly reducing hallucinations and enhancing alignment.

Contribution

The paper proposes CSR, a self-improving approach that incorporates visual constraints into reward modeling, enabling models to iteratively enhance performance without external preference data.

Findings

01

CSR improves performance across ten benchmarks by 7.62%.

02

It reduces hallucinations and enhances modality alignment.

03

The method is compatible with various vision-language models.

Abstract

Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. Despite these advancements, LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image, indicating a misalignment between image and text pairs. This misalignment arises because the model tends to prioritize textual information over visual input, even when both the language model and visual representations are of high quality. Existing methods leverage additional models or human annotations to curate preference data and enhance modality alignment through preference optimization. These approaches may not effectively reflect the target LVLM's preferences, making the curated preferences easily distinguishable. Our work addresses these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yiyangzhou/csr
pytorchOfficial

Videos

Calibrated Self-Rewarding Vision Language Models· slideslive

Taxonomy

TopicsLanguage, Metaphor, and Cognition · Categorization, perception, and language