SRBench: A Comprehensive Benchmark for Sequential Recommendation with Large Language Models
Jianhong Li, Zeheng Qian, Wangze Ni, Haoyang Li, Hongwei Yao, Yang Bai, Kui Ren

TL;DR
SRBench is a new comprehensive benchmark for evaluating sequential recommendation models, especially LLM-based ones, across multiple real-world relevant dimensions like fairness, stability, and efficiency.
Contribution
It introduces a multi-dimensional evaluation framework, a unified prompt-based input paradigm, and a novel answer extraction mechanism for fair comparison of SR models.
Findings
LLM-SR models overfocus on item popularity.
SRBench enables fair, multi-dimensional assessment of SR models.
Evaluation of 13 models reveals insights into LLM-SR capabilities.
Abstract
LLM development has aroused great interest in Sequential Recommendation (SR) applications. However, comprehensive evaluation of SR models remains lacking due to the limitations of the existing benchmarks: 1) an overemphasis on accuracy, ignoring other real-world demands (e.g., fairness); 2) existing datasets fail to unleash LLMs' potential, leading to unfair comparison between Neural-Network-based SR (NN-SR) models and LLM-based SR (LLM-SR) models; and 3) no reliable mechanism for extracting task-specific answers from unstructured LLM outputs. To address these limitations, we propose SRBench, a comprehensive SR benchmark with three core designs: 1) a multi-dimensional framework covering accuracy, fairness, stability and efficiency, aligned with practical demands; 2) a unified input paradigm via prompt engineering to boost LLM-SR performance and enable fair comparisons between models; 3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
