Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

Hung Cuong Pham; Fatih Gedikli

arXiv:2604.20420·cs.LG·April 23, 2026

Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving

Hung Cuong Pham, Fatih Gedikli

PDF

TL;DR

This paper evaluates and optimizes a BentoML-based AI inference system for scalable model serving, analyzing performance under realistic workloads and proposing improvements for efficiency and resilience.

Contribution

It provides a comprehensive performance analysis and optimization strategies for scalable AI inference using BentoML in real-world scenarios.

Findings

01

Optimizations significantly reduce latency and increase throughput.

02

Performance varies with workload intensity and distribution.

03

Deployment in a K3s cluster enhances system resilience.

Abstract

AI research often emphasizes model design and algorithmic performance, while deployment and inference remain comparatively underexplored despite being critical for real-world use. This study addresses that gap by investigating the performance and optimization of a BentoML-based AI inference system for scalable model serving developed in collaboration with graphworks.ai. The evaluation first establishes baseline performance under three realistic workload scenarios. To ensure a fair and reproducible assessment, a pre-trained RoBERTa sentiment analysis model is used throughout the experiments. The system is subjected to traffic patterns following gamma and exponential distributions in order to emulate real-world usage conditions, including steady, bursty, and high-intensity workloads. Key performance metrics, such as latency percentiles and throughput, are collected and analyzed to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.