Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks?
Spandan Garg, Vikram Nitin, Yufan Huang

TL;DR
Terminus-4B is a finetuned small language model that can replace frontier models in agentic execution tasks, reducing token usage and maintaining performance.
Contribution
This paper introduces Terminus-4B, a finetuned small model that matches or exceeds frontier models in agentic terminal execution tasks.
Findings
Terminus-4B reduces token usage by up to 30% compared to baseline.
It maintains performance on benchmarks like SWE-Bench Pro and internal benchmarks.
Terminus-4B often surpasses frontier models like Claude Sonnet and GPT-5.3-Codex.
Abstract
Modern coding agents increasingly delegate specialized subtasks to subagents, which are smaller, focused agentic loops that handle narrow responsibilities like search, debugging or terminal execution. This architectural pattern keeps the main agent's context window clean by isolating verbose outputs (e.g. build logs, test results, etc.) within the subagent context. Typically when agents employ subagents for such tasks, they use frontier models as these subagents. In this paper, we investigate whether a finetuned small language model (SLM) can achieve comparable performance to frontier models in the task of agentic terminal execution. We present Terminus-4B, which is a post-trained Qwen3-4B model via Supervised Finetuning (SFT) and Reinforcement Learning (RL) using rubric-based LLM-as-judge reward, specifically for this task. In our extensive evaluation spanning various frontier models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
