One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

Qi Jia; Ye Shen; Xiujie Song; Kaiwei Zhang; Shibo Wang; Dun Pei; Xiangyang Zhu; Guangtao Zhai

arXiv:2511.03508·cs.CL·January 9, 2026

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

Qi Jia, Ye Shen, Xiujie Song, Kaiwei Zhang, Shibo Wang, Dun Pei, Xiangyang Zhu, Guangtao Zhai

PDF

Open Access

TL;DR

This paper introduces EvolIF, a new benchmark and framework for evaluating large language models' multi-turn instruction-following abilities, emphasizing realistic user interactions and conversational depth.

Contribution

It presents a novel evolving benchmark with a framework that simulates user behavior and measures LLMs' performance over extended multi-turn dialogues.

Findings

01

GPT-5 shows the highest robustness at 66.40%.

02

Performance declines as conversation depth increases.

03

Existing models struggle with failure recovery and fine-grained instructions.

Abstract

Evaluating LLMs' instruction-following ability in multi-topic dialogues is essential yet challenging. Existing benchmarks are limited to a fixed number of turns, susceptible to saturation and failing to account for users' interactive experience. In this work, we propose a novel framework featuring a three-layer tracking mechanism and a query synthesis agent to mimic sequential user behaviors. Grounded in Flow Theory, we introduce process-centric metrics and terminate a conversational evaluation only upon exhausting user patience. Leveraging this framework, we present EvolIF, an evolving benchmark covering 12 constraint groups. Our analysis reveals deficiencies in failure recovery and fine-grained instruction following, with performance stratification becoming evident as conversational depth increases. GPT-5 demonstrates the most sustained resilience, maintaining a 66.40% robustness…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Speech and dialogue systems · Natural Language Processing Techniques