AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

Yunjia Qi; Hao Peng; Xiaozhi Wang; Amy Xin; Youfeng Liu; Bin Xu; Lei Hou; Juanzi Li

arXiv:2505.16944·cs.AI·May 23, 2025

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

Yunjia Qi, Hao Peng, Xiaozhi Wang, Amy Xin, Youfeng Liu, Bin Xu, Lei Hou, Juanzi Li

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces AgentIF, a comprehensive benchmark designed to evaluate large language models' ability to follow complex, lengthy instructions in realistic agentic scenarios, revealing current models' limitations.

Contribution

The paper presents the first benchmark for assessing LLM instruction following in agentic tasks, including a large dataset of real-world instructions and detailed evaluation metrics.

Findings

01

Current LLMs perform poorly on complex constraints.

02

Models struggle with tool specifications and lengthy instructions.

03

Error analysis reveals specific failure modes.

Abstract

Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-keg/agentif
noneOfficial

Datasets

THU-KEG/AgentIF
dataset· 177 dl
177 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques