LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

Xiaodong Wu; Minhao Wang; Yichen Liu; Xiaoming Shi; He Yan; Xiangju Lu; Junmin Zhu; Wei Zhang

arXiv:2411.07037·cs.CL·July 25, 2025

LIFBench: Evaluating the Instruction Following Performance and Stability of Large Language Models in Long-Context Scenarios

Xiaodong Wu, Minhao Wang, Yichen Liu, Xiaoming Shi, He Yan, Xiangju Lu, Junmin Zhu, Wei Zhang

PDF

Open Access 1 Repo

TL;DR

LIFBench and LIFEval are new tools designed to evaluate large language models' ability to follow instructions and maintain stability in long-context scenarios, addressing a gap in existing benchmarks.

Contribution

The paper introduces LIFBench, a scalable dataset, and LIFEval, an automated assessment method, for evaluating LLMs' instruction-following and stability in long-context settings.

Findings

01

LIFBench covers three long-context scenarios and eleven tasks.

02

LIFEval provides automated, rubric-based scoring without human input.

03

Experiments on 20 LLMs reveal performance variations across context lengths.

Abstract

As Large Language Models (LLMs) evolve in natural language processing (NLP), their ability to stably follow instructions in long-context inputs has become critical for real-world applications. However, existing benchmarks seldom focus on instruction-following in long-context scenarios or stability on different inputs. To bridge this gap, we introduce LIFBench, a scalable dataset designed to evaluate LLMs' instruction-following capabilities and stability across long contexts. LIFBench comprises three long-context scenarios and eleven diverse tasks, featuring 2,766 instructions generated through an automated expansion method across three dimensions: length, expression, and variables. For evaluation, we propose LIFEval, a rubric-based assessment method that enables precise, automated scoring of complex LLM responses without reliance on LLM-assisted assessments or human judgment. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sheldonwu0327/lif-bench-2024
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsFocus