VQL: An End-to-End Context-Aware Vector Quantization Attention for Ultra-Long User Behavior Modeling
Kaiyuan Li, Yongxiang Tang, Yanhua Cheng, Yong Bai, Yanxiang Zeng, Chao Wang, Xialong Liu, Peng Jiang

TL;DR
VQL introduces a novel, efficient, and context-aware vector quantization attention method for ultra-long user behavior sequences, significantly improving recommendation accuracy and latency in large-scale systems.
Contribution
The paper presents VQL, a new attention framework with key-only quantization, multi-scale codebooks, and efficient context injection, balancing compression, context-awareness, and efficiency.
Findings
VQL outperforms baselines on three large datasets.
Achieves higher accuracy with reduced inference latency.
Establishes new state of the art in ultra-long sequence modeling.
Abstract
In large-scale recommender systems, ultra-long user behavior sequences encode rich signals of evolving interests. Extending sequence length generally improves accuracy, but directly modeling such sequences in production is infeasible due to latency and memory constraints. Existing solutions fall into two categories: (1) top-k retrieval, which truncates the sequence and may discard most attention mass when L >> k; and (2) encoder-based compression, which preserves coverage but often over-compresses and fails to incorporate key context such as temporal gaps or target-aware signals. Neither class achieves a good balance of low-loss compression, context awareness, and efficiency. We propose VQL, a context-aware Vector Quantization Attention framework for ultra-long behavior modeling, with three innovations. (1) Key-only quantization: only attention keys are quantized, while values remain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
