VLFeedback: A Large-Scale AI Feedback Dataset for Large Vision-Language Models Alignment
Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen,, Yazheng Yang, Benyou Wang, Lingpeng Kong, Qi Liu

TL;DR
This paper introduces VLFeedback, a large-scale vision-language feedback dataset created with AI-generated annotations, and demonstrates how fine-tuning LVLMs with this data improves alignment, safety, and robustness.
Contribution
The paper presents VLFeedback, the first large-scale AI feedback dataset for LVLMs, and shows its effectiveness in enhancing model alignment and safety without human supervision.
Findings
Silkie outperforms base model in perception and cognition tasks.
Reduces hallucination issues on MMHal-Bench.
Enhances resilience against red-teaming attacks.
Abstract
As large vision-language models (LVLMs) evolve rapidly, the demand for high-quality and diverse data to align these models becomes increasingly crucial. However, the creation of such data with human supervision proves costly and time-intensive. In this paper, we investigate the efficacy of AI feedback to scale supervision for aligning LVLMs. We introduce VLFeedback, the first large-scale vision-language feedback dataset, comprising over 82K multi-modal instructions and comprehensive rationales generated by off-the-shelf models without human annotations. To evaluate the effectiveness of AI feedback for vision-language alignment, we train Silkie, an LVLM fine-tuned via direct preference optimization on VLFeedback. Silkie showcases exceptional performance regarding helpfulness, visual faithfulness, and safety metrics. It outperforms its base model by 6.9\% and 9.5\% in perception and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsBalanced Selection · ALIGN
