TL;DR
Dooly is a profiling tool for large language model inference that reduces costs and improves flexibility by leveraging structural input information to perform configuration-agnostic, efficient latency estimation.
Contribution
Dooly introduces a novel structure-aware profiling method that enables single-pass, configuration-agnostic latency estimation for diverse LLM inference workloads.
Findings
Achieves within 5% MAPE accuracy for TTFT and 8% for TPOT.
Reduces profiling GPU-hours by 56.4% across 12 models.
Works across multiple GPU platforms and attention backends.
Abstract
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making exploration prohibitively expensive. This cost stems from a missing structural understanding: every input dimension of each operation is fixed by the model configuration or determined by the incoming request. Many model-configuration values (e.g., head size, layer count) recur across models, so the same operation runs in many configurations; a single sweep over the request-dependent dimensions can serve them all. We present Dooly, which exploits this structure to achieve configuration-agnostic, redundancy-aware…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
