Guarded Query Routing for Large Language Models

Richard \v{S}l\'eher; William Brach; Tibor Sloboda; Kristi\'an Ko\v{s}\v{t}\'al; Lukas Galke

arXiv:2505.14524·cs.AI·January 29, 2026

Guarded Query Routing for Large Language Models

Richard \v{S}l\'eher, William Brach, Tibor Sloboda, Kristi\'an Ko\v{s}\v{t}\'al, Lukas Galke

PDF

1 Repo

TL;DR

This paper introduces GQR-Bench, a benchmark for guarded query routing in LLMs, comparing various models and methods, and finds that lightweight classifiers can achieve a good accuracy-speed trade-off, challenging the reliance on large LLMs.

Contribution

The paper presents GQR-Bench, a comprehensive benchmark for guarded query routing, and provides an extensive comparison of LLM-based and traditional models, highlighting practical trade-offs.

Findings

01

WideMLP achieves 88% accuracy with <4ms speed.

02

FastText offers 80% accuracy at <1ms speed.

03

LLMs reach 91% accuracy but are slower, with up to 669ms latency.

Abstract

Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a guarded query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench, released as Python package gqr), covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

williambrach/gqr
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · fastText