TL;DR
This paper presents a 3D optimization framework for AI inference scaling that jointly considers accuracy, cost, and latency, enabling more effective and environment-adaptive deployment strategies.
Contribution
It introduces a novel 3D multi-objective optimization approach for inference scaling, addressing limitations of traditional 1D and 2D heuristics, and provides a theoretical foundation for deployment-aware inference.
Findings
Knee-point optimization balances multiple objectives effectively.
Optimal inference scaling allows smaller models to outperform larger ones at lower costs.
The framework adapts to diverse operational conditions for improved deployment efficiency.
Abstract
AI inference scaling is often tuned through 1D heuristics (a fixed reasoning pass) or 2D bivariate trade-offs (e.g., accuracy vs. compute), which fail to consider cost and latency constraints. We introduce a 3D optimization framework that jointly calibrates accuracy, cost, and latency within a unified decision space, enabling constraints-aware inference scaling. Using Monte Carlo simulations across three representative scenarios and nine simulated large language models, we evaluate four optimization methods to address the 3D multi-objective optimization (MOO) problem. Framing inference scaling in MOO shapes a feasible space that 1D and 2D optimizations fail to capture, enabling environment-adaptive selection of the inference scaling~. Results show that knee-point optimization based on Pareto frontiers achieves the best balance, while accuracy-maximization remains favorable when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
