TL;DR
This paper introduces GQR-Bench, a benchmark for guarded query routing in LLMs, comparing various models and methods, and finds that lightweight classifiers can achieve a good accuracy-speed trade-off, challenging the reliance on large LLMs.
Contribution
The paper presents GQR-Bench, a comprehensive benchmark for guarded query routing, and provides an extensive comparison of LLM-based and traditional models, highlighting practical trade-offs.
Findings
WideMLP achieves 88% accuracy with <4ms speed.
FastText offers 80% accuracy at <1ms speed.
LLMs reach 91% accuracy but are slower, with up to 669ms latency.
Abstract
Query routing, the task to route user queries to different large language model (LLM) endpoints, can be considered as a text classification problem. However, out-of-distribution queries must be handled properly, as those could be about unrelated domains, queries in other languages, or even contain unsafe text. Here, we thus study a guarded query routing problem, for which we first introduce the Guarded Query Routing Benchmark (GQR-Bench, released as Python package gqr), covers three exemplary target domains (law, finance, and healthcare), and seven datasets to test robustness against out-of-distribution queries. We then use GQR-Bench to contrast the effectiveness and efficiency of LLM-based routing mechanisms (GPT-4o-mini, Llama-3.2-3B, and Llama-3.1-8B), standard LLM-based guardrail approaches (LlamaGuard and NVIDIA NeMo Guardrails), continuous bag-of-words classifiers (WideMLP,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · fastText
