Mitigating Hallucination in Large Multi-Modal Models via Robust   Instruction Tuning

Fuxiao Liu; Kevin Lin; Linjie Li; Jianfeng Wang; Yaser Yacoob; Lijuan; Wang

arXiv:2306.14565·cs.CV·March 21, 2024·24 cites

Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning

Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, Lijuan, Wang

PDF

Open Access 4 Repos 1 Models 3 Datasets

TL;DR

This paper introduces a large diverse dataset and evaluation method to reduce hallucinations in large multi-modal models through robust instruction tuning, improving their factual consistency across vision-and-language tasks.

Contribution

The paper presents the first large-scale visual instruction dataset with positive and negative samples and a new evaluation approach to mitigate hallucinations in multi-modal models.

Findings

01

LMMs exhibit significant hallucinations with negative instructions.

02

Finetuning on LRV-Instruction reduces hallucination.

03

Balanced positive and negative data improves robustness.

Abstract

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMMs) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at three semantic levels: (i) Nonexistent Object Manipulation, (ii) Existent Object Manipulation and (iii) Knowledge Manipulation. To efficiently…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
czxlovesu/prismatic-vlms
model

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsFocus