LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms

Jie Kong; Wei Wang; Jiehan Zhou; Chen Yu

arXiv:2602.09323·cs.DC·February 11, 2026

LLM-CoOpt: A Co-Design and Optimization Framework for Efficient LLM Inference on Heterogeneous Platforms

Jie Kong, Wei Wang, Jiehan Zhou, Chen Yu

PDF

Open Access

TL;DR

LLM-CoOpt is a comprehensive co-design framework that enhances large language model inference efficiency by optimizing memory, computation, and long-sequence processing, achieving significant throughput and latency improvements.

Contribution

It introduces a novel integrated framework combining cache optimization, grouped-query attention, and long-sequence processing strategies for efficient LLM inference.

Findings

01

Inference throughput increased by up to 13.43%

02

Latency reduced by up to 16.79%

03

Maintains model accuracy during optimization

Abstract

Major challenges in LLMs inference remain frequent memory bandwidth bottlenecks, computational redundancy, and inefficiencies in long-sequence processing. To address these issues, we propose LLM-CoOpt, a comprehensive algorithmhardware co-design framework aimed at improving both throughput and latency in LLM inference. LLM-CoOpt integrates three key strategies: (1) Key-Value Cache Optimization, termed Opt-KV, which improves memory access efficiency by optimizing both KV cache write and read paths, and introduces FP8 quantization to reduce memory footprint while maintaining accuracy; (2) Grouped-Query Attention for Computational Efficiency, termed Opt-GQA, which reduces the overall computational complexity by restructuring multi-head self-attention into grouped-query attention with shared key-value projections, enabling higher throughput and lower resource consumption; (3) Paged…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Machine Learning and Algorithms