SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

Jianchang Su; Yifan Zhang; Shengkai Lin; Shizhen Zhao; Yusheng Zheng; Yiwei Yang; Wei Zhang

arXiv:2601.22397·cs.LG·February 2, 2026

SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

Jianchang Su, Yifan Zhang, Shengkai Lin, Shizhen Zhao, Yusheng Zheng, Yiwei Yang, Wei Zhang

PDF

Open Access

TL;DR

SAIR is a novel autoscaling framework for multi-stage ML inference pipelines that leverages in-context reinforcement learning with an LLM to optimize resource allocation and latency without offline training.

Contribution

SAIR introduces an in-context RL controller using LLMs for online policy improvement in autoscaling, combining reward shaping, surprisal-guided retrieval, and fine-grained GPU control.

Findings

01

Achieves up to 50% P99 latency reduction

02

Reduces effective resource cost by up to 97%

03

Detects bottlenecks with 86% accuracy

Abstract

Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition