ETHIC: Evaluating Large Language Models on Long-Context Tasks with High   Information Coverage

Taewhoo Lee; Chanwoong Yoon; Kyochul Jang; Donghyeon Lee; Minju Song,; Hyunjae Kim; Jaewoo Kang

arXiv:2410.16848·cs.CL·February 28, 2025

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage

Taewhoo Lee, Chanwoong Yoon, Kyochul Jang, Donghyeon Lee, Minju Song,, Hyunjae Kim, Jaewoo Kang

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

ETHIC is a new benchmark designed to evaluate large language models' ability to utilize the full extent of long input contexts, revealing significant performance gaps in current models across diverse high-information coverage tasks.

Contribution

The paper introduces the information coverage metric and the ETHIC benchmark, which together provide a more effective evaluation of LLMs' long-context understanding capabilities.

Findings

01

Current benchmarks have low information coverage.

02

LLMs show significant performance drops on ETHIC tasks.

03

ETHIC covers diverse domains like books, debates, medicine, law.

Abstract

Recent advancements in large language models (LLM) capable of processing extremely long texts highlight the need for a dedicated evaluation benchmark to assess their long-context capabilities. However, existing methods, like the needle-in-a-haystack test, do not effectively assess whether these models fully utilize contextual information, raising concerns about the reliability of current evaluation techniques. To thoroughly examine the effectiveness of existing benchmarks, we introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries. Our findings indicate that current benchmarks exhibit low IC; although the input context may be extensive, the actual usable context is often limited. To address this, we present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dmis-lab/ethic
pytorchOfficial

Datasets

dmis-lab/ETHIC
dataset· 18 dl
18 dl

Videos

ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques