SEQUOR: A Multi-Turn Benchmark for Realistic Constraint Following
Beatriz Canaverde, Duarte M. Alves, Jos\'e Pombal, Giuseppe Attanasio, Andr\'e F. T. Martins

TL;DR
SEQUOR introduces a benchmark for evaluating how well models adhere to constraints in long multi-turn conversations, revealing current limitations in instruction-following accuracy as conversations grow longer.
Contribution
The paper presents SEQUOR, a new automatic benchmark for assessing constraint adherence in long-horizon multi-turn conversations, highlighting challenges faced by current models.
Findings
Instruction-following accuracy drops over 11% in long conversations.
Accuracy decreases over 40% when following multiple constraints.
Model accuracy declines by over 9% when constraints are added or replaced.
Abstract
In a conversation, a helpful assistant must reliably follow user directives, even as they refine, modify, or contradict earlier requests. Yet most instruction-following benchmarks focus on single-turn or short multi-turn scenarios, leaving open how well models handle long-horizon instruction-following tasks. To bridge this gap, we present SEQUOR, an automatic benchmark for evaluating constraint adherence in long multi-turn conversations. SEQUOR consists of simulated persona-driven interactions built with constraints extracted from real-world conversations. Our results show that even when following a single constraint, instruction-following accuracy consistently decreases as the conversation grows longer, with drops exceeding 11%. This decline becomes larger when models have to follow multiple constraints simultaneously, reducing their accuracy by over 40%. In scenarios where constraints…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
