FGAIF: Aligning Large Vision-Language Models with Fine-grained AI Feedback
Liqiang Jing, Xinya Du

TL;DR
This paper introduces FGAIF, a novel approach that uses fine-grained AI feedback to improve alignment in large vision-language models, reducing hallucinations and enhancing performance with fewer parameters.
Contribution
The paper proposes a new fine-grained AI feedback method for aligning LVLMs, addressing limitations of existing RL-based approaches by providing detailed feedback and dense rewards.
Findings
Significantly reduces hallucination issues in LVLMs.
Improves performance on visual-language benchmarks.
Achieves better results with fewer model parameters.
Abstract
Large Vision-Language Models (LVLMs) have demonstrated proficiency in tackling a variety of visual-language tasks. However, current LVLMs suffer from misalignment between text and image modalities which causes three kinds of hallucination problems, i.e., object existence, object attribute, and object relationship. To tackle this issue, existing methods mainly utilize Reinforcement Learning (RL) to align modalities in LVLMs. However, they still suffer from three main limitations: (1) General feedback can not indicate the hallucination type contained in the response; (2) Sparse rewards only give the sequence-level reward for the whole response; and (3)Annotation cost is time-consuming and labor-intensive. To handle these limitations, we propose an innovative method to align modalities in LVLMs through Fine-Grained Artificial Intelligence Feedback (FGAIF), which mainly consists of three…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsALIGN
