Efficient Multi-Model Orchestration for Self-Hosted Large Language Models
Bhanu Prakash Vangala, Tanu Malik

TL;DR
This paper presents Pick and Spin, a scalable, cost-effective framework for orchestrating self-hosted large language models using Kubernetes, adaptive routing, and workload automation to improve performance and efficiency.
Contribution
It introduces a practical, hybrid routing and automation system for self-hosted LLMs, enhancing scalability, cost-efficiency, and reliability over traditional static deployment methods.
Findings
Up to 21.6% higher success rates
30% lower latency
33% lower GPU cost per query
Abstract
Self-hosting large language models (LLMs) is increasingly appealing for organizations seeking privacy, cost control, and customization. Yet deploying and maintaining in-house models poses challenges in GPU utilization, workload routing, and reliability. We introduce Pick and Spin, a practical framework that makes self-hosted LLM orchestration scalable and economical. Built on Kubernetes, it integrates a unified Helm-based deployment system, adaptive scale-to-zero automation, and a hybrid routing module that balances cost, latency, and accuracy using both keyword heuristics and a lightweight DistilBERT classifier. We evaluate four models, Llama-3 (90B), Gemma-3 (27B), Qwen-3 (235B), and DeepSeek-R1 (685B) across eight public benchmark datasets, with five inference strategies, and two routing variants encompassing 31,019 prompts and 163,720 inference runs. Pick and Spin achieves up to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMobile Crowdsensing and Crowdsourcing · Advanced Neural Network Applications · Topic Modeling
