Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

Warren Johnson; Charles Lee

arXiv:2604.02367·cs.NI·April 6, 2026

Evaluating Small Language Models for Front-Door Routing: A Harmonized Benchmark and Synthetic-Traffic Experiment

Warren Johnson, Charles Lee

PDF

TL;DR

Small language models (SLMs) can perform task routing efficiently at inference time, offering a low-cost, low-latency alternative to larger models and classifiers, but still face accuracy and quality challenges for production use.

Contribution

This paper evaluates the effectiveness of small language models for task routing, demonstrating their potential and limitations through a harmonized benchmark and synthetic traffic experiments.

Findings

01

Qwen-2.5-3B achieves the best accuracy and latency tradeoff among tested models.

02

DeepSeek-V3 has the highest accuracy but exceeds latency constraints.

03

No model currently meets the criteria for standalone production viability.

Abstract

Selecting the appropriate model at inference time -- the routing problem -- requires jointly optimizing output quality, cost, latency, and governance constraints. Existing approaches delegate this decision to LLM-based classifiers or preference-trained routers that are themselves costly and high-latency, reducing a multi-objective optimization to single-dimensional quality prediction. We argue that small language models (SLMs, 1-4B parameters) have now achieved sufficient reasoning capability for sub-second, zero-marginal-cost, self-hosted task classification, potentially making the routing decision negligible in the inference budget. We test this thesis on a six-label taxonomy through two studies. Study 1 is a harmonized offline benchmark of Phi-3.5-mini, Qwen2.5-1.5B, and Qwen-2.5-3B on identical Azure T4 hardware, serving stack, quantization, and a fixed 60-case corpus. Qwen-2.5-3B…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.