VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language   Models Alignment

Lei Li; Zhihui Xie; Mukai Li; Shunian Chen; Peiyi Wang; Liang Chen,; Yazheng Yang; Benyou Wang; Lingpeng Kong; Qi Liu

arXiv:2410.09421·cs.CV·October 21, 2024

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment

Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen,, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu

PDF

Open Access 1 Video

TL;DR

This paper introduces VLFeedback, a large-scale vision-language feedback dataset created with AI-generated annotations, and demonstrates how fine-tuning LVLMs with this data improves alignment, safety, and robustness.

Contribution

The paper presents VLFeedback, the first large-scale AI feedback dataset for LVLMs, and shows its effectiveness in enhancing model alignment and safety without human supervision.

Findings

01

Silkie outperforms base model in perception and cognition tasks.

02

Reduces hallucination issues on MMHal-Bench.

03

Enhances resilience against red-teaming attacks.

Abstract

As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsBalanced Selection · ALIGN