LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios
Xiaodong Wu, Minhao Wang, Yichen Liu, Xiaoming Shi, He Yan, Xiangju Lu, Junmin Zhu, Wei Zhang

TL;DR
LIFBench and LIFEval are new tools designed to evaluate large language models' ability to follow instructions and maintain stability in long-context scenarios, addressing a gap in existing benchmarks.
Contribution
The paper introduces LIFBench, a scalable dataset, and LIFEval, an automated assessment method, for evaluating LLMs' instruction-following and stability in long-context settings.
Findings
LIFBench covers three long-context scenarios and eleven tasks.
LIFEval provides automated, rubric-based scoring without human input.
Experiments on 20 LLMs reveal performance variations across context lengths.
Abstract
As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsFocus
