Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders
Yupeng Hou, Jiacheng Li, Xiangjun Fu, Zhankui He, An Yan, Xiusi Chen, Julian McAuley

TL;DR
This paper introduces BLaIR, a comprehensive benchmark for evaluating large language models as semantic encoders in recommendation tasks, addressing the unique challenges of textual feature utilization in recommender systems.
Contribution
The paper presents a new large-scale dataset, a unified benchmark across multiple recommendation tasks, and a complex-query product search evaluation, highlighting the distinct challenges of LLMs in recommendation.
Findings
LLMs' rankings on BLaIR differ significantly from general-purpose benchmarks.
The new benchmark reveals unique challenges in semantic encoding for recommendation.
Experiments with 11 LLMs demonstrate varied performance across tasks.
Abstract
Feature engineering has long been central to recommender systems, yet effectively leveraging textual item features remains challenging. Recent advances in large language models (LLMs) have enabled their use as semantic encoders for recommendation, but their roles and behaviors in this setting are still not well understood. Prior studies often rely on general-purpose embedding benchmarks (e.g., MTEB) when selecting LLMs, overlooking the unique characteristics of recommendation tasks. To address this gap, we introduce BLaIR, a comprehensive benchmark for evaluating LLMs as semantic encoders in recommendation scenarios. We contribute (1) a new large-scale Amazon Reviews 2023 dataset with over 570 million reviews and 48 million items, (2) a unified benchmark covering sequential recommendation, collaborative filtering, and product search, and (3) a new complex-query product search task…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗hyp1231/blair-roberta-basemodel· 2.5k dl· ♡ 32.5k dl♡ 3
- 🤗hyp1231/blair-roberta-largemodel· 1.8k dl· ♡ 21.8k dl♡ 2
- 🤗hyp1231/blair-games-roberta-basemodel· 10 dl10 dl
- 🤗hyp1231/blair-games-roberta-largemodel· 9 dl9 dl
- 🤗innerCircuit/llama3-sentiment-Cell-Phones-Accessories-3class-baseline-150kmodel· 2 dl2 dl
- 🤗innerCircuit/llama3-sentiment-Cell-Phones-Accessories-3class-sequential-150kmodel
- 🤗innerCircuit/llama3-sentiment-Cell-Phones-Accessories-binary-baseline-150kmodel· 3 dl3 dl
- 🤗innerCircuit/llama3-sentiment-Electronics-binary-baseline-150kmodel· 2 dl2 dl
- 🤗innerCircuit/llama3-sentiment-All-Beauty-binary-baseline-150kmodel· 1 dl1 dl
- McAuley-Lab/Amazon-Reviews-2023dataset· 52k dl52k dl
- McAuley-Lab/Amazon-C4dataset· 201 dl201 dl
- cogsci13/Amazon-Reviews-2023-Books-Reviewdataset· 5.1k dl5.1k dl
- cogsci13/Amazon-Reviews-2023-Books-Metadataset· 32k dl32k dl
- milistu/AMAZON-Products-2023dataset· 535 dl535 dl
- milistu/AMAZON-Products-2023-Arabicdataset· 182 dl182 dl
- smartcat/Amazon_All_Beauty_2018dataset· 70 dl70 dl
- smartcat/Amazon_Sports_and_Outdoors_2018dataset· 238 dl238 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
