ERBench: An Entity-Relationship based Automatically Verifiable   Hallucination Benchmark for Large Language Models

Jio Oh; Soyeon Kim; Junseok Seo; Jindong Wang; Ruochen Xu; Xing Xie,; Steven Euijong Whang

arXiv:2403.05266·cs.CL·November 5, 2024·1 cites

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models

Jio Oh, Soyeon Kim, Junseok Seo, Jindong Wang, Ruochen Xu, Xing Xie,, Steven Euijong Whang

PDF

Open Access 1 Repo 1 Video

TL;DR

ERBench leverages entity-relationship databases with integrity constraints to create a dynamic, complex, and verifiable benchmark for evaluating large language models' reasoning and answer accuracy.

Contribution

This paper introduces ERBench, a novel benchmark that uses relational databases with integrity constraints to evaluate LLMs' reasoning and answer verification capabilities.

Findings

01

ERBench effectively evaluates LLMs across multiple domains.

02

It verifies answers by checking for correct keywords and reasoning.

03

ERBench supports continuous, multimodal, and complex question evaluation.

Abstract

Large language models (LLMs) have achieved unprecedented performances in various applications, yet evaluating them is still challenging. Existing benchmarks are either manually constructed or are automatic, but lack the ability to evaluate the thought process of LLMs with arbitrary complexity. We contend that utilizing existing relational databases based on the entity-relationship (ER) model is a promising approach for constructing benchmarks as they contain structured knowledge that can be used to question LLMs. Unlike knowledge graphs, which are also used to evaluate LLMs, relational databases have integrity constraints that can be used to better construct complex in-depth questions and verify answers: (1) functional dependencies can be used to pinpoint critical keywords that an LLM must know to properly answer a given question containing certain attribute values; and (2) foreign key…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dilab-kaist/erbench
noneOfficial

Videos

ERBench: An Entity-Relationship based Automatically Verifiable Hallucination Benchmark for Large Language Models· slideslive

Taxonomy

TopicsMachine Learning in Healthcare

MethodsAttention Is All You Need · Linear Layer · Dropout · Multi-Head Attention · Position-Wise Feed-Forward Layer · Layer Normalization · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing