LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large   Language Models

Han Qiu; Jiaxing Huang; Peng Gao; Qin Qi; Xiaoqin Zhang; Ling Shao,; Shijian Lu

arXiv:2410.09962·cs.CV·October 16, 2024

LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models

Han Qiu, Jiaxing Huang, Peng Gao, Qin Qi, Xiaoqin Zhang, Ling Shao,, Shijian Lu

PDF

Open Access 1 Repo

TL;DR

LongHalQA introduces a novel, LLM-free benchmark with long, complex hallucination data for more realistic evaluation of multimodal large language models, addressing limitations of previous benchmarks.

Contribution

It presents a new benchmark with long, real-world aligned hallucination data and two unified tasks, enabling more reliable and efficient hallucination evaluation without LLM evaluators.

Findings

01

Recent MLLMs struggle with long, complex hallucinations.

02

The benchmark reveals new challenges in handling detailed textual data.

03

The proposed evaluation pipeline facilitates future benchmark development.

Abstract

Hallucination, a phenomenon where multimodal large language models~(MLLMs) tend to generate textual responses that are plausible but unaligned with the image, has become one major hurdle in various MLLM-related applications. Several benchmarks have been created to gauge the hallucination levels of MLLMs, by either raising discriminative questions about the existence of objects or introducing LLM evaluators to score the generated text from MLLMs. However, the discriminative data largely involve simple questions that are not aligned with real-world text, while the generative data involve LLM evaluators that are computationally intensive and unstable due to their inherent randomness. We propose LongHalQA, an LLM-free hallucination benchmark that comprises 6K long and complex hallucination text. LongHalQA is featured by GPT4V-generated hallucinatory data that are well aligned with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hanqiu-hq/longhalqa
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare