TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition
Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen

TL;DR
TaxBreak is a trace-driven methodology that decomposes host-side overheads in LLM inference, helping identify whether to optimize software stack or device-side execution for latency improvements.
Contribution
It introduces TaxBreak, a novel decomposition approach and Host-Device Balance Index for diagnosing host and device overheads in LLM inference.
Findings
MoE models dispatch 8-11x more kernels per output token than dense models.
Faster host CPU reduces orchestration overhead by 10-29%.
TaxBreak helps determine whether to optimize software or device-side workloads.
Abstract
Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · IoT and Edge/Fog Computing · Parallel Computing and Optimization Techniques
