TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

Prabhu Vellaisamy; Shreesh Tripathi; Vignesh Natarajan; Surya Santhan Thenarasu; Shawn Blanton; John P. Shen

arXiv:2603.12465·cs.DC·March 16, 2026

TaxBreak: Unmasking the Hidden Costs of LLM Inference Through Overhead Decomposition

Prabhu Vellaisamy, Shreesh Tripathi, Vignesh Natarajan, Surya Santhan Thenarasu, Shawn Blanton, John P. Shen

PDF

Open Access

TL;DR

TaxBreak is a trace-driven methodology that decomposes host-side overheads in LLM inference, helping identify whether to optimize software stack or device-side execution for latency improvements.

Contribution

It introduces TaxBreak, a novel decomposition approach and Host-Device Balance Index for diagnosing host and device overheads in LLM inference.

Findings

01

MoE models dispatch 8-11x more kernels per output token than dense models.

02

Faster host CPU reduces orchestration overhead by 10-29%.

03

TaxBreak helps determine whether to optimize software or device-side workloads.

Abstract

Large Language Model (LLM) inference is widely used in interactive assistants and agentic systems. In latency-sensitive deployments, inference time can become dominated by host-side overheads. Existing approaches typically expose this cost only as an aggregate residual or a launch/queue metric, which is often insufficient to identify which execution layer should be optimized. This work presents TaxBreak, a trace-driven methodology for decomposing host-visible orchestration overhead into three components: framework translation time, CUDA library translation time, and kernel launch-path time. We validate TaxBreak on NVIDIA H100 and H200 systems and use it to derive our proposed Host-Device Balance Index (HDBI), a boundedness summary index that relates device-active execution to host-visible orchestration. Across representative dense and mixture-of-experts workloads in both prefill and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSoftware System Performance and Reliability · IoT and Edge/Fog Computing · Parallel Computing and Optimization Techniques