Efficient user history modeling with amortized inference for deep   learning recommendation models

Lars Hertel; Neil Daftary; Fedor Borisyuk; Aman Gupta; Rahul Mazumder

arXiv:2412.06924·cs.LG·December 11, 2024

Efficient user history modeling with amortized inference for deep learning recommendation models

Lars Hertel, Neil Daftary, Fedor Borisyuk, Aman Gupta, Rahul Mazumder

PDF

Open Access

TL;DR

This paper improves deep learning recommendation models by using amortized inference with a novel user history modeling approach, significantly reducing latency while maintaining recommendation quality.

Contribution

It introduces a reformulation of the M-FALCON amortized inference algorithm for DLRMs and demonstrates its effectiveness in reducing latency in real-world deployment.

Findings

01

Amortized inference reduces latency by 30% in LinkedIn applications.

02

Appending candidate items with cross-attention performs on par with concatenation.

03

Reformulating M-FALCON for DLRMs enables efficient user history modeling.

Abstract

We study user history modeling via Transformer encoders in deep learning recommendation models (DLRM). Such architectures can significantly improve recommendation quality, but usually incur high latency cost necessitating infrastructure upgrades or very small Transformer models. An important part of user history modeling is early fusion of the candidate item and various methods have been studied. We revisit early fusion and compare concatenation of the candidate to each history item against appending it to the end of the list as a separate item. Using the latter method, allows us to reformulate the recently proposed amortized history inference algorithm M-FALCON \cite{zhai2024actions} for the case of DLRM models. We show via experimental results that appending with cross-attention performs on par with concatenation and that amortization significantly reduces inference costs. We conclude…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Topic Modeling · Generative Adversarial Networks and Image Synthesis

MethodsAttention Is All You Need · Adam · Dropout · Position-Wise Feed-Forward Layer · Softmax · Dense Connections · Byte Pair Encoding · Linear Layer · Multi-Head Attention · Label Smoothing