Benchmarking Hallucination in Large Language Models based on   Unanswerable Math Word Problem

Yuhong Sun; Zhangyue Yin; Qipeng Guo; Jiawen Wu; Xipeng Qiu; Hui Zhao

arXiv:2403.03558·cs.CL·March 7, 2024·1 cites

Benchmarking Hallucination in Large Language Models based on Unanswerable Math Word Problem

Yuhong Sun, Zhangyue Yin, Qipeng Guo, Jiawen Wu, Xipeng Qiu, Hui Zhao

PDF

Open Access 1 Repo

TL;DR

This paper introduces a new benchmark and dataset for evaluating hallucination in large language models using unanswerable math word problems, demonstrating that specific training methods reduce hallucination.

Contribution

It presents the UMWP dataset and a novel evaluation methodology to assess LLM hallucination in math QA, highlighting the impact of in-context learning and RLHF.

Findings

01

In-context learning reduces hallucination.

02

RLHF training improves model reliability.

03

UMWP is effective for hallucination assessment.

Abstract

Large language models (LLMs) are highly effective in various natural language processing (NLP) tasks. However, they are susceptible to producing unreliable conjectures in ambiguous contexts called hallucination. This paper presents a new method for evaluating LLM hallucination in Question Answering (QA) based on the unanswerable math word problem (MWP). To support this approach, we innovatively develop a dataset called Unanswerable Math Word Problem (UMWP) which comprises 5200 questions across five categories. We developed an evaluation methodology combining text similarity and mathematical expression detection to determine whether LLM considers the question unanswerable. The results of extensive experiments conducted on 31 LLMs, including GPT-3, InstructGPT, LLaMA, and Claude, demonstrate that in-context learning and reinforcement learning with human feedback (RLHF) training…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yuki-asuuna/umwp
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Big Data and Digital Economy

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Linear Layer · Cosine Annealing · {Dispute@FaQ-s}How to file a dispute with Expedia? · Byte Pair Encoding · Multi-Head Attention · Linear Warmup With Cosine Annealing · Layer Normalization