HelloBench: Evaluating Long Text Generation Capabilities of Large Language Models
Haoran Que, Feiyu Duan, Liqun He, Yutao Mou, Wangchunshu Zhou, Jiaheng, Liu, Wenge Rong, Zekun Moore Wang, Jian Yang, Ge Zhang, Junran Peng,, Zhaoxiang Zhang, Songyang Zhang, Kai Chen

TL;DR
HelloBench is a comprehensive benchmark designed to evaluate large language models' ability to generate long texts across various tasks, addressing a gap in current assessments and proposing an efficient human-aligned evaluation method.
Contribution
The paper introduces HelloBench and HelloEval, new benchmarks and evaluation methods specifically for assessing long text generation in LLMs, with extensive experiments on 30 models.
Findings
Most LLMs cannot generate texts longer than 4000 words.
Longer text generation often suffers from repetition and quality issues.
HelloEval correlates highly with human judgment, outperforming traditional metrics.
Abstract
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks (e.g., long-context understanding), and many benchmarks have been proposed. However, we observe that long text generation capabilities are not well investigated. Therefore, we introduce the Hierarchical Long Text Generation Benchmark (HelloBench), a comprehensive, in-the-wild, and open-ended benchmark to evaluate LLMs' performance in generating long text. Based on Bloom's Taxonomy, HelloBench categorizes long text generation tasks into five subtasks: open-ended QA, summarization, chat, text completion, and heuristic text generation. Besides, we propose Hierarchical Long Text Evaluation (HelloEval), a human-aligned evaluation method that significantly reduces the time and effort required for human evaluation while maintaining a high correlation with human evaluation. We have conducted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Text Readability and Simplification
