Efficient Recommendation with Millions of Items by Dynamic Pruning of   Sub-Item Embeddings

Aleksandr V. Petrov; Craig Macdonald; Nicola Tonellotto

arXiv:2505.00560·cs.IR·May 2, 2025

Efficient Recommendation with Millions of Items by Dynamic Pruning of Sub-Item Embeddings

Aleksandr V. Petrov, Craig Macdonald, Nicola Tonellotto

PDF

TL;DR

This paper introduces RecJPQPrune, a dynamic pruning algorithm that significantly speeds up large-scale recommendation inference by efficiently identifying top items without scoring the entire catalogue, maintaining accuracy and reducing latency.

Contribution

The paper presents RecJPQPrune, a novel dynamic pruning method that guarantees safe top-K recommendations, enabling fast inference on millions of items without approximation or GPU reliance.

Findings

01

Reduces median scoring time by 64x on large datasets.

02

Scores 2 million items in under 10 ms without GPUs.

03

Maintains recommendation effectiveness with guaranteed top-K safety.

Abstract

A large item catalogue is a major challenge for deploying modern sequential recommender models, since it makes the memory footprint of the model large and increases inference latency. One promising approach to address this is RecJPQ, which replaces item embeddings with sub-item embeddings. However, slow inference remains problematic because finding the top highest-scored items usually requires scoring all items in the catalogue, which may not be feasible for large catalogues. By adapting dynamic pruning concepts from document retrieval, we propose the RecJPQPrune dynamic pruning algorithm to efficiently find the top highest-scored items without computing the scores of all items in the catalogue. Our RecJPQPrune algorithm is safe-up-to-rank K since it theoretically guarantees that no potentially high-scored item is excluded from the final top K recommendation list, thereby ensuring no…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Layer · Multi-Head Attention · Dense Connections · Adam · Attention Is All You Need · Dropout · Pruning · Layer Normalization · Position-Wise Feed-Forward Layer · Byte Pair Encoding