The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

Xinyi Chen; Baohao Liao; Jirui Qi; Panagiotis Eustratiadis; Christof Monz; Arianna Bisazza; Maarten de Rijke

arXiv:2406.19999·cs.CL·December 12, 2025

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models

Xinyi Chen, Baohao Liao, Jirui Qi, Panagiotis Eustratiadis, Christof Monz, Arianna Bisazza, Maarten de Rijke

PDF

Open Access 1 Repo 1 Video

TL;DR

The SIFo Benchmark assesses large language models' ability to follow multiple instructions sequentially, revealing current models' limitations and the need for improved robustness in instruction following tasks.

Contribution

We introduce the SIFo Benchmark, a new evaluation framework for sequential instruction following in LLMs, addressing coherence, bias, and verifiability challenges.

Findings

01

Larger, recent models outperform smaller ones on SIFo tasks.

02

All models show significant struggles with sequential instruction following.

03

The benchmark effectively reveals robustness issues in current LLMs.

Abstract

Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rules), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

shin-ee-chen/SIFo
pytorchOfficial

Videos

The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling