Enhancing Performance and Scalability of Large-Scale Recommendation   Systems with Jagged Flash Attention

Rengan Xu; Junjie Yang; Yifan Xu; Hong Li; Xing Liu; Devashish; Shankar; Haoci Zhang; Meng Liu; Boyang Li; Yuxi Hu; Mingwei Tang; Zehua; Zhang; Tunhou Zhang; Dai Li; Sijia Chen; Gian-Paolo Musumeci; Jiaqi Zhai,; Bill Zhu; Hong Yan; Srihari Reddy

arXiv:2409.15373·cs.LG·September 25, 2024

Enhancing Performance and Scalability of Large-Scale Recommendation Systems with Jagged Flash Attention

Rengan Xu, Junjie Yang, Yifan Xu, Hong Li, Xing Liu, Devashish, Shankar, Haoci Zhang, Meng Liu, Boyang Li, Yuxi Hu, Mingwei Tang, Zehua, Zhang, Tunhou Zhang, Dai Li, Sijia Chen, Gian-Paolo Musumeci, Jiaqi Zhai,, Bill Zhu, Hong Yan, Srihari Reddy

PDF

TL;DR

This paper introduces Jagged Flash Attention, a novel method that significantly improves the efficiency and scalability of large-scale recommendation systems by optimizing attention mechanisms for variable-length categorical features.

Contribution

We develop Jagged Feature Interaction Kernels and integrate them with Flash Attention to handle dynamic tensor sizes, achieving substantial speedups and memory savings in recommendation models.

Findings

01

Up to 9x speedup over dense attention

02

22x memory reduction compared to dense attention

03

10% QPS improvement in production models

Abstract

The integration of hardware accelerators has significantly advanced the capabilities of modern recommendation systems, enabling the exploration of complex ranking paradigms previously deemed impractical. However, the GPU-based computational costs present substantial challenges. In this paper, we demonstrate our development of an efficiency-driven approach to explore these paradigms, moving beyond traditional reliance on native PyTorch modules. We address the specific challenges posed by ranking models' dependence on categorical features, which vary in length and complicate GPU utilization. We introduce Jagged Feature Interaction Kernels, a novel method designed to extract fine-grained insights from long categorical features through efficient handling of dynamically sized tensors. We further enhance the performance of attention mechanisms by integrating Jagged tensors with Flash…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need