Beyond Facts: Evaluating Intent Hallucination in Large Language Models
Yijie Hao, Haofei Yu, Jiaxuan You

TL;DR
This paper introduces the concept of intent hallucination in large language models, presents a new benchmark FAITHQA to evaluate it, and proposes an automatic metric CONSTRAINT SCORE to detect such hallucinations.
Contribution
The paper defines intent hallucination, creates FAITHQA benchmark for its evaluation, and develops CONSTRAINT SCORE for automatic detection, advancing understanding and measurement of this issue.
Findings
Intent hallucination is common in state-of-the-art LLMs.
It mainly results from omission or misinterpretation of query parts.
CONSTRAINT SCORE aligns closely with human judgment in detection.
Abstract
When exposed to complex queries containing multiple conditions, today's large language models (LLMs) tend to produce responses that only partially satisfy the query while neglecting certain conditions. We therefore introduce the concept of Intent Hallucination. In this phenomenon, LLMs either omit (neglecting to address certain parts) or misinterpret (responding to invented query parts) elements of the given query, leading to intent hallucinated generation. To systematically evaluate intent hallucination, we introduce FAITHQA, a novel benchmark for intent hallucination that contains 20,068 problems, covering both query-only and retrieval-augmented generation (RAG) setups with varying topics and difficulty. FAITHQA is the first hallucination benchmark that goes beyond factual verification, tailored to identify the fundamental cause of intent hallucination. By evaluating various LLMs on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Advanced Graph Neural Networks · Multimodal Machine Learning Applications
