AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

Ranit Karmakar; Jayita Chatterjee

arXiv:2605.00334·cs.AI·May 4, 2026

AgentFloor: How Far Up the tool use Ladder Can Small Open-Weight Models Go?

Ranit Karmakar, Jayita Chatterjee

PDF

TL;DR

AgentFloor introduces a 30-task benchmark to evaluate how small open-weight models can handle routine agent tasks, revealing that many short-horizon functions are sufficiently managed by smaller models, reserving larger models for complex planning.

Contribution

This paper presents AgentFloor, a new benchmark and evaluation of 16 open-weight models, demonstrating their effectiveness in routine agent tasks and defining the boundary where larger models are necessary.

Findings

01

Small and mid-sized open-weight models match GPT-5 on short-horizon tasks.

02

Larger models outperform on long-horizon planning and sustained coordination.

03

The boundary of model necessity is not solely determined by scale, but also by task-specific factors.

Abstract

Production agentic systems make many model calls per user request, and most of those calls are short, structured, and routine. This raises a practical routing question that existing evaluations do not directly answer: which parts of an agent workflow truly require large frontier intelligence, and which can be handled by smaller models? We introduce AgentFloor, a deterministic 30-task benchmark organized as a six-tier capability ladder, spanning instruction following, tool use, multi-step coordination, and long-horizon planning under persistent constraints. We evaluate 16 open-weight models, from 0.27B to 32B parameters, alongside GPT-5 across 16,542 scored runs. Our results reveal a clear boundary of model necessity. Small and mid-sized open-weight models are already sufficient for much of the short-horizon, structured tool use work that dominates real agent pipelines, and in aggregate,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.