OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Deming Ding; Shichun Liu; Enhui Yang; Jiahang Lin; Ziying Chen; Shihan Dou; Honglin Guo; Weiyu Cheng; Pengyu Zhao; Chengjun Xiao; Qunhong Zeng; Qi Zhang; Xuanjing Huang; Qidi Xu; Tao Gui

arXiv:2601.10343·cs.CL·January 19, 2026

OctoBench: Benchmarking Scaffold-Aware Instruction Following in Repository-Grounded Agentic Coding

Deming Ding, Shichun Liu, Enhui Yang, Jiahang Lin, Ziying Chen, Shihan Dou, Honglin Guo, Weiyu Cheng, Pengyu Zhao, Chengjun Xiao, Qunhong Zeng, Qi Zhang, Xuanjing Huang, Qidi Xu, Tao Gui

PDF

Open Access 1 Datasets

TL;DR

OctoBench is a comprehensive benchmark designed to evaluate how well large language models follow heterogeneous, scaffold-specified instructions in repository-grounded coding tasks, highlighting gaps in current model compliance.

Contribution

This paper introduces OctoBench, a new benchmark with diverse environments and detailed scoring tools to assess scaffold-aware instruction following in coding agents.

Findings

01

Models show a significant gap between task-solving and rule-following.

02

Benchmark covers 34 environments and 217 tasks with detailed scoring.

03

Releases toolkit for reproducible evaluation and model improvement.

Abstract

Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

MiniMaxAI/OctoBench
dataset· 119 dl
119 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Machine Learning and Algorithms