AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?
Ranit Karmakar, Jayita Chatterjee

TL;DR
AgentFloor introduces a 30-task benchmark to evaluate how small open-weight models can handle routine agent tasks, revealing that many short-horizon functions are sufficiently managed by smaller models, reserving larger models for complex planning.
Contribution
This paper presents AgentFloor, a new benchmark and evaluation of 16 open-weight models, demonstrating their effectiveness in routine agent tasks and defining the boundary where larger models are necessary.
Findings
Small and mid-sized open-weight models match GPT-5 on short-horizon tasks.
Larger models outperform on long-horizon planning and sustained coordination.
The boundary of model necessity is not solely determined by scale, but also by task-specific factors.
Abstract
Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller models? We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs. Our results reveal a clear boundary of model necessity. Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines, and in aggregate,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
