H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference

Zizhuo Fu; Xiaotian Guo; Wenxuan Zeng; Shuzhang Zhong; Yadong Zhang; Peiyu Chen; Runsheng Wang; Le Ye; Meng Li

arXiv:2508.16653·cs.PF·December 9, 2025

H2EAL: Hybrid-Bonding Architecture with Hybrid Sparse Attention for Efficient Long-Context LLM Inference

Zizhuo Fu, Xiaotian Guo, Wenxuan Zeng, Shuzhang Zhong, Yadong Zhang, Peiyu Chen, Runsheng Wang, Le Ye, Meng Li

PDF

TL;DR

H2EAL is a hybrid-bonding accelerator with a hybrid sparse attention algorithm-hardware co-design that significantly improves the efficiency of long-context LLM inference at the edge, reducing energy and latency overheads.

Contribution

It introduces a novel hybrid sparse attention scheme combined with hardware co-design and load-balancing strategies for efficient edge inference of large language models.

Findings

01

Achieves up to 48.21x speedup over baseline

02

Improves energy efficiency by up to 73.48x

03

Maintains negligible accuracy drop of 0.87%

Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in a wide range of natural language processing applications. However, the high energy and latency overhead induced by the KV cache limits the edge deployment, especially for long contexts. Emerging hybrid bonding (HB) technology has been proposed as a promising alternative to conventional near-memory processing (NMP) architectures, offering improved bandwidth efficiency and lower power consumption while exhibiting characteristics of distributed memory. In this paper, we propose H2EAL, a hybrid bonding-based accelerator with sparse attention algorithm-hardware co-design for efficient LLM inference at the edge. At the algorithm level, we propose a hybrid sparse attention scheme with static and dynamic sparsity for different heads to fully leverage the sparsity with high accuracy. At the hardware level, we co-design the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.