EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens

Chaoqun Yang; Xinyu Lin; Wenjie Wang; Yongqi Li; Teng Sun; Xianjing Han; Tat-Seng Chua

arXiv:2507.00715·cs.IR·July 2, 2025

EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens

Chaoqun Yang, Xinyu Lin, Wenjie Wang, Yongqi Li, Teng Sun, Xianjing Han, Tat-Seng Chua

PDF

Open Access

TL;DR

EARN is a novel inference acceleration framework for LLM-based generative recommendation that compresses interaction history into register tokens, significantly reducing latency and memory use while maintaining high accuracy.

Contribution

The paper introduces EARN, a new method leveraging layer-wise attention insights to efficiently compress information into register tokens, enabling faster inference in LLMRec systems.

Findings

01

Achieves up to 3.79x speedup in inference

02

Reduces KV Cache by 80.8%

03

Maintains better accuracy than finetuning methods

Abstract

Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency due to massive computational overhead and memory pressure of KV Cache. Existing KV Cache reduction methods face critical limitations: cache compression offers marginal acceleration given recommendation tasks' short decoding steps, while prompt compression risks discarding vital interaction history. Through systematic analysis of attention patterns in LLMRec, we uncover two pivotal insights: 1) layer-wise attention sparsity inversion where early layers retain dense informative patterns while later layers exhibit high redundancy, and 2) dual attention sinks phenomenon where attention scores concentrate on both head and tail tokens of input sequences. Motivated by these insights, we propose EARN, an efficient inference framework that leverages the early…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Generative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare