Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

S. Aaron McClendon; Jorge Gallego-Feliciano; Stavros Zervoudakis; and Antonios Saravanos

arXiv:2604.11465·cs.AI·April 16, 2026

Three Roles, One Model: Role Orchestration at Inference Time to Close the Performance Gap Between Small and Large Agents

S. Aaron McClendon, Jorge Gallego-Feliciano, Stavros Zervoudakis, and Antonios Saravanos

PDF

TL;DR

This paper demonstrates that inference-time role-based scaffolding significantly improves small language model performance on complex tasks, making them competitive with larger models without additional training.

Contribution

It introduces a three-role inference scaffolding pipeline that enhances small LLMs' performance on multi-step tasks without extra training or fine-tuning.

Findings

01

Scaffolding roughly doubles task goal completion rates.

02

Small models outperform larger models on the AppWorld benchmark with scaffolding.

03

Structured inference interventions can rival systems four times larger.

Abstract

Large language model (LLM) agents show promise on realistic tool-use tasks, but deploying capable agents on modest hardware remains challenging. We study whether inference-time scaffolding alone, without any additional training compute, can improve the performance of a small model in complex multi-step environments. Operating on a single 24GB GPU, we evaluate Qwen3-8B on the AppWorld benchmark under both full-precision and 4-bit quantized configurations. Without any intervention, the raw model achieves just 5.4% (FP16) and 3.0% (AWQ) task goal completion. Guided by a systematic failure mode analysis, we introduce a three-tier inference scaffolding pipeline that deploys the same frozen model in three distinct roles: (1) a summarization model that preserves critical artifacts (tokens, credentials, API responses) while compressing dialogue history; (2) the main agent model that reasons…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.