PickLLM: Context-Aware RL-Assisted Large Language Model Routing

Dimitrios Sikeridis; Dennis Ramdass; Pranay Pareek

arXiv:2412.12170·cs.LG·December 18, 2024

PickLLM: Context-Aware RL-Assisted Large Language Model Routing

Dimitrios Sikeridis, Dennis Ramdass, Pranay Pareek

PDF

Open Access

TL;DR

PickLLM is a reinforcement learning-based framework that dynamically routes queries to the most suitable large language model, optimizing for cost, latency, and accuracy in real-time.

Contribution

It introduces a novel RL-based approach for LLM routing that considers multiple customizable objectives and converges efficiently to optimal model selection.

Findings

01

Reduces query cost and latency effectively.

02

Converges quickly to optimal LLM choices.

03

Improves response quality based on customizable scoring.

Abstract

Recently, the number of off-the-shelf Large Language Models (LLMs) has exploded with many open-source options. This creates a diverse landscape regarding both serving options (e.g., inference on local hardware vs remote LLM APIs) and model heterogeneous expertise. However, it is hard for the user to efficiently optimize considering operational cost (pricing structures, expensive LLMs-as-a-service for large querying volumes), efficiency, or even per-case specific measures such as response accuracy, bias, or toxicity. Also, existing LLM routing solutions focus mainly on cost reduction, with response accuracy optimizations relying on non-generalizable supervised training, and ensemble approaches necessitating output computation for every considered LLM candidate. In this work, we tackle the challenge of selecting the optimal LLM from a model pool for specific queries with customizable…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Sparse Evolutionary Training · Q-Learning · Focus