LongGenBench: Long-context Generation Benchmark

Xiang Liu; Peijie Dong; Xuming Hu; Xiaowen Chu

arXiv:2410.04199·cs.CL·October 25, 2024

LongGenBench: Long-context Generation Benchmark

Xiang Liu, Peijie Dong, Xuming Hu, Xiaowen Chu

PDF

Open Access 1 Repo 1 Video

TL;DR

LongGenBench is a new benchmark designed to evaluate the ability of large language models to generate coherent, long-context responses, revealing performance degradation across different models and configurations.

Contribution

This paper introduces LongGenBench, a synthetic benchmark specifically for assessing long-context generation capabilities of LLMs, filling a gap in existing evaluation tools.

Findings

01

Models show 1.2% to 47.1% performance degradation in long-context generation.

02

Gemini-1.5-Flash exhibits the least degradation among API models.

03

Qwen2 series shows the least degradation among open source models.

Abstract

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dominic789654/longgenbench
noneOfficial

Videos

LongGenBench: Long-context Generation Benchmark· underline

Taxonomy

TopicsParallel Computing and Optimization Techniques · Multimedia Communication and Technology

MethodsFocus