TL;DR
LIFEBench is a comprehensive benchmark that evaluates large language models' ability to follow explicit length instructions across diverse tasks and languages, revealing significant limitations especially at longer lengths.
Contribution
This paper introduces LIFEBench, the first extensive benchmark for assessing LLMs' performance on length instruction following across multiple tasks, lengths, and languages.
Findings
Most models follow short-length instructions well but struggle with longer lengths.
Models rarely achieve their claimed maximum output lengths in practice.
Reasoning LLMs outperform specialized long-text models in length following.
Abstract
While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsFocus
