VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling

Kaiyuan Li; Yongxiang Tang; Yanhua Cheng; Yong Bai; Yanxiang Zeng; Chao Wang; Xialong Liu; Peng Jiang

arXiv:2508.17125·cs.IR·August 26, 2025

VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling

Kaiyuan Li, Yongxiang Tang, Yanhua Cheng, Yong Bai, Yanxiang Zeng, Chao Wang, Xialong Liu, Peng Jiang

PDF

TL;DR

VQL introduces a novel, efficient, and context-aware vector quantization attention method for ultra-long user behavior sequences, significantly improving recommendation accuracy and latency in large-scale systems.

Contribution

The paper presents VQL, a new attention framework with key-only quantization, multi-scale codebooks, and efficient context injection, balancing compression, context-awareness, and efficiency.

Findings

01

VQL outperforms baselines on three large datasets.

02

Achieves higher accuracy with reduced inference latency.

03

Establishes new state of the art in ultra-long sequence modeling.

Abstract

In large-scale recommender systems, ultra-long user behavior sequences encode rich signals of evolving interests. Extending sequence length generally improves accuracy, but directly modeling such sequences in production is infeasible due to latency and memory constraints. Existing solutions fall into two categories: (1) top-k retrieval, which truncates the sequence and may discard most attention mass when L >> k; and (2) encoder-based compression, which preserves coverage but often over-compresses and fails to incorporate key context such as temporal gaps or target-aware signals. Neither class achieves a good balance of low-loss compression, context awareness, and efficiency. We propose VQL, a context-aware Vector Quantization Attention framework for ultra-long behavior modeling, with three innovations. (1) Key-only quantization: only attention keys are quantized, while values remain…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.