Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment
Warren Johnson, Charles Lee

TL;DR
Small language models (SLMs) can perform task routing efficiently at inference time, offering a low-cost, low-latency alternative to larger models and classifiers, but still face accuracy and quality challenges for production use.
Contribution
This paper evaluates the effectiveness of small language models for task routing, demonstrating their potential and limitations through a harmonized benchmark and synthetic traffic experiments.
Findings
Qwen-2.5-3B achieves the best accuracy and latency tradeoff among tested models.
DeepSeek-V3 has the highest accuracy but exceeds latency constraints.
No model currently meets the criteria for standalone production viability.
Abstract
Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
