Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou; Jiacheng Li; Xiangjun Fu; Zhankui He; An Yan; Xiusi Chen; Julian McAuley

arXiv:2403.03952·cs.IR·April 21, 2026·25 cites

Bridging Language and Items for Retrieval and Recommendation: Benchmarking LLMs as Semantic Encoders

Yupeng Hou, Jiacheng Li, Xiangjun Fu, Zhankui He, An Yan, Xiusi Chen, Julian McAuley

PDF

1 Repo 9 Models 44 Datasets

TL;DR

This paper introduces BLaIR, a comprehensive benchmark for evaluating large language models as semantic encoders in recommendation tasks, addressing the unique challenges of textual feature utilization in recommender systems.

Contribution

The paper presents a new large-scale dataset, a unified benchmark across multiple recommendation tasks, and a complex-query product search evaluation, highlighting the distinct challenges of LLMs in recommendation.

Findings

01

LLMs' rankings on BLaIR differ significantly from general-purpose benchmarks.

02

The new benchmark reveals unique challenges in semantic encoding for recommendation.

03

Experiments with 11 LLMs demonstrate varied performance across tasks.

Abstract

Feature engineering has long been central to recommender systems, yet effectively leveraging textual item features remains challenging. Recent advances in large language models (LLMs) have enabled their use as semantic encoders for recommendation, but their roles and behaviors in this setting are still not well understood. Prior studies often rely on general-purpose embedding benchmarks (e.g., MTEB) when selecting LLMs, overlooking the unique characteristics of recommendation tasks. To address this gap, we introduce BLaIR, a comprehensive benchmark for evaluating LLMs as semantic encoders in recommendation scenarios. We contribute (1) a new large-scale Amazon Reviews 2023 dataset with over 570 million reviews and 48 million items, (2) a unified benchmark covering sequential recommendation, collaborative filtering, and product search, and (3) a new complex-query product search task…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hyp1231/amazonreviews2023
github

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.