Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem
Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, Hui Zhao

TL;DR
This paper introduces a new benchmark and dataset for evaluating hallucination in large language models using unanswerable math word problems, demonstrating that specific training methods reduce hallucination.
Contribution
It presents the UMWP dataset and a novel evaluation methodology to assess LLM hallucination in math QA, highlighting the impact of in-context learning and RLHF.
Findings
In-context learning reduces hallucination.
RLHF training improves model reliability.
UMWP is effective for hallucination assessment.
Abstract
Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Big Data and Digital Economy
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Byte Pair Encoding · Multi-Head Attention · Linear Warmup With Cosine Annealing · Layer Normalization
