AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Kaiyuan Chen; Qimin Wu; Taiyu Hou; Tianhao Tang; Xueyu Hu; Yuchen Hou; Bikun Li; Chengming Qian; Guoyin Wang; Haolin Chen; Haotong Tian; Haoye Zhang; Haoyu Bian; Hongbing Pan; Hongkang Zhang; Hongyi Zhou; Jiaqi Cai; Jiewu Rao; Jiyuan Ren; Keduan Huang; Lucia Zhu Huang; Mingyu Yuan; Naixu Guo; Qicheng Tang; Qinyan Zhang; Shuai Chen; Siheng Chen; Ting Ting Li; Xiaoxing Guo; Yaocheng Zuo; Yaoqi Guo; Yinan Wang; Yinzhou Yu; Yize Wang; Yuan Jiang; Yuan Tian; Yuanshuo Zhang; Yuxuan Liu; Yvette Yan Zeng; Zenyu Shan; Zihan Yin; Xiaobo Hu; Yang Liu; Yixin Ren; Yuan Gong

arXiv:2601.20613·cs.CL·February 2, 2026

AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios

Kaiyuan Chen, Qimin Wu, Taiyu Hou, Tianhao Tang, Xueyu Hu, Yuchen Hou, Bikun Li, Chengming Qian, Guoyin Wang, Haolin Chen, Haotong Tian, Haoye Zhang, Haoyu Bian, Hongbing Pan, Hongkang Zhang, Hongyi Zhou, Jiaqi Cai, Jiewu Rao, Jiyuan Ren, Keduan Huang, Lucia Zhu Huang

PDF

Open Access 1 Datasets

TL;DR

AgentIF-OneDay introduces a comprehensive benchmark to evaluate general AI agents on diverse daily tasks, emphasizing real-world usability and multi-faceted instructions, with promising results for current leading models.

Contribution

This paper presents a new benchmark, AgentIF-OneDay, for assessing AI agents on realistic daily tasks, including workflow execution, implicit instruction inference, and iterative refinement.

Findings

01

Leading AI agents perform well on the benchmark.

02

Open-source models have internalized agentic capabilities.

03

API-based and ChatGPT agents remain top-tier.

Abstract

The capacity of AI agents to effectively handle tasks of increasing duration and complexity continues to grow, demonstrating exceptional performance in coding, deep research, and complex problem-solving evaluations. However, in daily scenarios, the perception of these advanced AI capabilities among general users remains limited. We argue that current evaluations prioritize increasing task difficulty without sufficiently addressing the diversity of agentic tasks necessary to cover the daily work, life, and learning activities of a broad demographic. To address this, we propose AgentIF-OneDay, aimed at determining whether general users can utilize natural language instructions and AI agents to complete a diverse array of daily tasks. These tasks require not only solving problems through dialogue but also understanding various attachment types and delivering tangible file-based results.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

xbench/AgentIF-OneDay
dataset· 587 dl
587 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Explainable Artificial Intelligence (XAI) · Topic Modeling