LIFEBench: Evaluating Length Instruction Following in Large Language Models

Wei Zhang; Zhenhong Zhou; Kun Wang; Junfeng Fang; Yuanhe Zhang; Rui Wang; Ge Zhang; Xavier Li; Li Sun; Lingjuan Lyu; Yang Liu; Sen Su

arXiv:2505.16234·cs.CL·June 12, 2025

LIFEBench: Evaluating Length Instruction Following in Large Language Models

Wei Zhang, Zhenhong Zhou, Kun Wang, Junfeng Fang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xavier Li, Li Sun, Lingjuan Lyu, Yang Liu, Sen Su

PDF

1 Repo

TL;DR

LIFEBench is a comprehensive benchmark that evaluates large language models' ability to follow explicit length instructions across diverse tasks and languages, revealing significant limitations especially at longer lengths.

Contribution

This paper introduces LIFEBench, the first extensive benchmark for assessing LLMs' performance on length instruction following across multiple tasks, lengths, and languages.

Findings

01

Most models follow short-length instructions well but struggle with longer lengths.

02

Models rarely achieve their claimed maximum output lengths in practice.

03

Reasoning LLMs outperform specialized long-text models in length following.

Abstract

While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lifebench/lifebench
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsFocus