Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in   Retrieval-Augmented Generation

Zhouyu Jiang; Mengshu Sun; Zhiqiang Zhang; Lei Liang

arXiv:2502.19209·cs.CL·February 27, 2025

Bi'an: A Bilingual Benchmark and Model for Hallucination Detection in Retrieval-Augmented Generation

Zhouyu Jiang, Mengshu Sun, Zhiqiang Zhang, Lei Liang

PDF

Open Access

TL;DR

This paper introduces Bi'an, a bilingual benchmark and lightweight judge models for detecting hallucinations in Retrieval-Augmented Generation, improving evaluation and detection accuracy.

Contribution

It presents a novel bilingual benchmark dataset and fine-tuned lightweight judge models for better hallucination detection in RAG systems.

Findings

01

14B model outperforms larger baselines

02

Rivals state-of-the-art closed-source LLMs

03

Extensive evaluation on Bi'anBench

Abstract

Retrieval-Augmented Generation (RAG) effectively reduces hallucinations in Large Language Models (LLMs) but can still produce inconsistent or unsupported content. Although LLM-as-a-Judge is widely used for RAG hallucination detection due to its implementation simplicity, it faces two main challenges: the absence of comprehensive evaluation benchmarks and the lack of domain-optimized judge models. To bridge these gaps, we introduce \textbf{Bi'an}, a novel framework featuring a bilingual benchmark dataset and lightweight judge models. The dataset supports rigorous evaluation across multiple RAG scenarios, while the judge models are fine-tuned from compact open-source LLMs. Extensive experimental evaluations on Bi'anBench show our 14B model outperforms baseline models with over five times larger parameter scales and rivals state-of-the-art closed-source LLMs. We will release our data and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Misinformation and Its Impacts