HelloBench: Evaluating Long Text Generation Capabilities of Large   Language Models

Haoran Que; Feiyu Duan; Liqun He; Yutao Mou; Wangchunshu Zhou; Jiaheng; Liu; Wenge Rong; Zekun Moore Wang; Jian Yang; Ge Zhang; Junran Peng,; Zhaoxiang Zhang; Songyang Zhang; Kai Chen

arXiv:2409.16191·cs.CL·September 25, 2024·2 cites

HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models

Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng, Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, Junran Peng,, Zhaoxiang Zhang, Songyang Zhang, Kai Chen

PDF

Open Access 1 Repo 1 Datasets

TL;DR

HelloBench is a comprehensive benchmark designed to evaluate large language models' ability to generate long texts across various tasks, addressing a gap in current assessments and proposing an efficient human-aligned evaluation method.

Contribution

The paper introduces HelloBench and HelloEval, new benchmarks and evaluation methods specifically for assessing long text generation in LLMs, with extensive experiments on 30 models.

Findings

01

Most LLMs cannot generate texts longer than 4000 words.

02

Longer text generation often suffers from repetition and quality issues.

03

HelloEval correlates highly with human judgment, outperforming traditional metrics.

Abstract

In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

quehry/hellobench
noneOfficial

Datasets

quehry/HelloBench
dataset· 125 dl
125 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification