LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Yukang Feng; Jianwen Sun; Zelai Yang; Jiaxin Ai; Chuanhao Li; Zizhen Li; Fanrui Zhang; Kang He; Rui Ma; Jifan Lin; Jie Sun; Yang Xiao; Sizhuo Zhou; Wenxiao Wu; Yiming Liu; Pengfei Liu; Yu Qiao; Shenglin Zhang; Kaipeng Zhang

arXiv:2602.14337·cs.SE·February 27, 2026

LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces

Yukang Feng, Jianwen Sun, Zelai Yang, Jiaxin Ai, Chuanhao Li, Zizhen Li, Fanrui Zhang, Kang He, Rui Ma, Jifan Lin, Jie Sun, Yang Xiao, Sizhuo Zhou, Wenxiao Wu, Yiming Liu, Pengfei Liu, Yu Qiao, Shenglin Zhang, Kaipeng Zhang

PDF

Open Access

TL;DR

LongCLI-Bench is a new benchmark for evaluating AI agents on long-horizon, realistic command-line tasks, revealing current agents' limitations and emphasizing the need for human collaboration and improved planning.

Contribution

The paper introduces LongCLI-Bench, a comprehensive benchmark with a dual-set testing protocol and step-level scoring for assessing long-horizon agentic programming in CLI tasks.

Findings

01

State-of-the-art agents achieve below 20% pass rates.

02

Most tasks stall before 30% completion.

03

Human collaboration significantly improves performance.

Abstract

Recent advances in AI-assisted programming have empowered agents to execute complex workflows via command-line interfaces, however, existing benchmarks are limited by short task horizons, data contamination from GitHub scraping, and a lack of fine-grained evaluation metrics, fail to rigorously evaluate the long-horizon planning and execution capabilities essential for realistic software engineering. To address these gaps, we introduce LongCLI-Bench, a comprehensive benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks. We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world workflows, covering four engineering categories: from scratch, feature addition, bug fixing, and refactoring. We propose a dual-set testing protocol for LongCLI-Bench, which measures requirement fulfillment (fail-to-pass) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · Advanced Software Engineering Methodologies · Scientific Computing and Data Management