Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

Samyam Rajbhandari; Mert Hidayetoglu; Aurick Qiao; Ye Wang; Juncheng Yang; Jeff Rasley; Michael Wyatt; Yuxiong He

arXiv:2507.11830·cs.DC·July 17, 2025

Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI

Samyam Rajbhandari, Mert Hidayetoglu, Aurick Qiao, Ye Wang, Juncheng Yang, Jeff Rasley, Michael Wyatt, Yuxiong He

PDF

Open Access 1 Repo

TL;DR

Arctic Inference introduces Shift Parallelism, a dynamic and efficient open-source system for enterprise AI inference that significantly improves speed and cost-effectiveness by integrating innovative parallelism and decoding strategies.

Contribution

The paper presents Shift Parallelism, a novel dynamic parallelism strategy for AI inference that adapts to traffic and enhances performance and efficiency.

Findings

01

Up to 3.4x faster request completion

02

1.75x faster generation

03

1.6M tokens/sec per GPU for embeddings

Abstract

Inference is now the dominant AI workload, yet existing systems force trade-offs between latency, throughput, and cost. Arctic Inference, an open-source vLLM plugin from Snowflake AI Research, introduces Shift Parallelism, a dynamic parallelism strategy that adapts to real-world traffic while integrating speculative decoding, SwiftKV compute reduction, and optimized embedding inference. It achieves up to 3.4 times faster request completion, 1.75 times faster generation, and 1.6M tokens/sec per GPU for embeddings, outperforming both latency- and throughput-optimized deployments. Already powering Snowflake Cortex AI, Arctic Inference delivers state-of-the-art, cost-effective inference for enterprise AI and is now available to the community.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

snowflakedb/arcticinference
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsScientific Computing and Data Management · Distributed and Parallel Computing Systems · Reservoir Engineering and Simulation Methods