Scalable AI Inference: Performance Analysis and Optimization of AI Model Serving
Hung Cuong Pham, Fatih Gedikli

TL;DR
This paper evaluates and optimizes a BentoML-based AI inference system for scalable model serving, analyzing performance under realistic workloads and proposing improvements for efficiency and resilience.
Contribution
It provides a comprehensive performance analysis and optimization strategies for scalable AI inference using BentoML in real-world scenarios.
Findings
Optimizations significantly reduce latency and increase throughput.
Performance varies with workload intensity and distribution.
Deployment in a K3s cluster enhances system resilience.
Abstract
AI research often emphasizes model design and algorithmic performance, while deployment and inference remain comparatively underexplored despite being critical for real-world use. This study addresses that gap by investigating the performance and optimization of a BentoML-based AI inference system for scalable model serving developed in collaboration with graphworks.ai. The evaluation first establishes baseline performance under three realistic workload scenarios. To ensure a fair and reproducible assessment, a pre-trained RoBERTa sentiment analysis model is used throughout the experiments. The system is subjected to traffic patterns following gamma and exponential distributions in order to emulate real-world usage conditions, including steady, bursty, and high-intensity workloads. Key performance metrics, such as latency percentiles and throughput, are collected and analyzed to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
