Efficient Multi-Model Orchestration for Self-Hosted Large Language Models

Bhanu Prakash Vangala; Tanu Malik

arXiv:2512.22402·cs.DC·December 30, 2025

Efficient Multi-Model Orchestration for Self-Hosted Large Language Models

Bhanu Prakash Vangala, Tanu Malik

PDF

Open Access 1 Video

TL;DR

This paper presents Pick and Spin, a scalable, cost-effective framework for orchestrating self-hosted large language models using Kubernetes, adaptive routing, and workload automation to improve performance and efficiency.

Contribution

It introduces a practical, hybrid routing and automation system for self-hosted LLMs, enhancing scalability, cost-efficiency, and reliability over traditional static deployment methods.

Findings

01

Up to 21.6% higher success rates

02

30% lower latency

03

33% lower GPU cost per query

Abstract

Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in-house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self-hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models, Llama-3 (90B), Gemma-3 (27B), Qwen-3 (235B), and DeepSeek-R1 (685B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 31,019 prompts and 163,720 inference runs. Pick and Spin achieves up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Efficient Multi-Model Orchestration for Self-Hosted Large Language Models· underline

Taxonomy

TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Topic Modeling