SemSR: Semantics aware robust Session-based Recommendations

Jyoti Narwariya; Priyanka Gupta; Muskan Gupta; Jyotsana Khatri; Lovekesh Vig

arXiv:2508.20587·cs.IR·August 29, 2025

SemSR: Semantics aware robust Session-based Recommendations

Jyoti Narwariya, Priyanka Gupta, Muskan Gupta, Jyotsana Khatri, Lovekesh Vig

PDF

Open Access

TL;DR

This paper introduces SemSR, a semantics-aware approach that leverages Large Language Models to enhance session-based recommendation systems by combining semantic understanding with traditional methods, improving overall recommendation quality.

Contribution

The paper proposes novel methods to incorporate LLMs into SR models, including in-context recommendation, semantic initialization, and hybrid integration, demonstrating significant performance improvements.

Findings

01

LLM-based methods excel at coarse-level retrieval with high recall.

02

Traditional data-driven models perform better at fine-grained ranking with high MRR.

03

Hybrid models outperform standalone LLM and data-driven approaches in both recall and MRR.

Abstract

Session-based recommendation (SR) models aim to recommend items to anonymous users based on their behavior during the current session. While various SR models in the literature utilize item sequences to predict the next item, they often fail to leverage semantic information from item titles or descriptions impeding session intent identification and interpretability. Recent research has explored Large Language Models (LLMs) as promising approaches to enhance session-based recommendations, with both prompt-based and fine-tuning based methods being widely investigated. However, prompt-based methods struggle to identify optimal prompts that elicit correct reasoning and lack task-specific feedback at test time, resulting in sub-optimal recommendations. Fine-tuning methods incorporate domain-specific knowledge but incur significant computational costs for implementation and maintenance. In…

Tables3

Table 1. Table 1. Statistics of the datasets used for experiments.

Datasets

#train

#test

#items

Avg.

| s |

Amazon-M2 (UK)

1172181

115936

499611

4.12

Amazon-Beauty

290512

21580

54615

6.40

Table 2. Table 2. Evaluation on Amazon-M2(UK) and Beauty datasets. We report overall performance in terms of Recall@20, MRR@20, Recall@100 and MRR@100. Bold numbers are for the best.

Amazon-M2 (UK)

Amazon-Beauty

Method

R@20

MRR@20

R@100

MRR@100

R@20

MRR@20

R@100

MRR@100

NARM

32.49

17.00

42.55

17.26

15.68

4.49

29.36

4.82

SRGNN

41.77

23.54

51.67

23.54

19.67

6.14

34.59

6.05

MSGAT

38.86

21.41

49.95

21.69

21.55

6.41

38.45

6.83

Angle embeddings

with BERT

SemMSGAT-I

46.00

14.90

63.49

15.33

15.30

3.93

32.06

4.32

SemMSGAT-I+

46.00

25.82

63.49

27.25

15.30

5.77

32.06

7.08

SemMSGAT-F

51.55

27.36

65.81

27.73

22.80

7.44

41.22

7.88

SemMSGAT-F+

51.55

27.37

65.81

27.48

22.80

7.66

41.22

7.94

Llama 3

embeddings

SemMSGAT-I

47.48

16.46

65.59

16.91

18.00

4.80

37.23

5.26

SemMSGAT-I+

47.48

26.15

65.59

27.45

18.00

6.19

37.23

7.32

SemMSGAT-F

52.66

26.12

68.82

26.53

24.07

6.93

43.64

7.40

SemMSGAT-F+

52.66

27.42

68.82

27.52

24.07

7.72

43.64

7.85

NISER

49.84

26.45

61.77

26.76

20.84

6.95

38.35

7.36

Angle embeddings

with BERT

SemNISER-I

46.96

14.25

64.44

14.68

19.62

5.07

37.00

5.49

SemNISER-I+

46.96

26.06

64.44

27.33

19.62

6.94

37.00

7.66

SemNISER-F

54.98

26.41

69.90

26.79

22.67

6.10

42.41

6.57

SemNISER-F+

54.98

27.71

69.90

27.57

22.67

7.33

42.41

7.63

Llama 3

embeddings

SemNISER-I

50.05

16.47

67.62

16.91

19.95

4.80

40.23

5.28

SemNISER-I+

50.05

26.85

67.62

27.57

19.95

6.54

40.23

7.38

SemNISER-F

54.26

25.19

69.87

25.59

20.95

5.42

40.36

5.89

SemNISER-F+

54.26

27.66

69.87

27.60

20.95

7.24

40.36

7.56

Table 3. Table 3. LLM as RS comparison with data-driven SR methods on subset of Amazon-M2(UK) and Beauty dataset.

	Amazon-M2 (UK)				Amazon-Beauty
Method	R@20	MRR@20	R@100	MRR@100	R@20	MRR@20	R@100	MRR@100
FS-LLM	31.83	9.92	47.50	10.30	7.07	1.90	14.48	2.07
ZCoT-LLM	25.16	7.65	39.29	7.99	7.12	1.82	15.08	1.99
FSCoT-LLM	31.70	10.00	47.97	10.44	7.04	1.86	14.32	1.89
MSGAT	26.28	14.90	34.74	15.09	21.55	6.41	38.46	6.83
SemMSGAT-I	44.28	14.76	61.09	15.18	18.00	4.80	37.23	5.26
SemMSGAT-F	47.64	21.55	63.79	21.96	24.07	6.93	43.64	7.40
NISER	38.02	20.21	47.56	20.46	20.84	6.95	38.35	7.36
SemNISER-I	46.64	14.77	62.97	15.18	19.95	4.80	40.23	5.28
SemNISER-F	47.17	20.83	63.83	21.25	20.95	5.42	40.36	5.89

Equations2

\hat{y}_{k} = \frac{exp ( i _{k}^{T} s )}{\sum _{j = 1}^{n} exp ( i _{j}^{T} s )} .

\hat{y}_{k} = \frac{exp ( i _{k}^{T} s )}{\sum _{j = 1}^{n} exp ( i _{j}^{T} s )} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Topic Modeling · Machine Learning in Healthcare

Full text

SemSR: Semantics aware robust Session-based Recommendations

Jyoti Narwariya

[email protected]

TCS ResearchNew DelhiIndia

,

Priyanka Gupta

[email protected]

TCS ResearchNew DelhiIndia

,

Muskan Gupta

[email protected]

TCS ResearchNew DelhiIndia

,

Jyotsana Khatri

[email protected]

TCS ResearchPuneIndia

and

Lovekesh Vig

[email protected]

TCS ResearchNew DelhiIndia

(2025)

Abstract.

Session-based recommendation (SR) models aim to recommend items to anonymous users based on their behavior during the current session. While various SR models in the literature utilize item sequences to predict the next item, they often fail to leverage semantic information from item titles or descriptions impeding session intent identification and interpretability. Recent research has explored Large Language Models (LLMs) as promising approaches to enhance session-based recommendations, with both prompt-based and fine-tuning based methods being widely investigated. However, prompt-based methods struggle to identify optimal prompts that elicit correct reasoning and lack task-specific feedback at test time, resulting in suboptimal recommendations. Fine-tuning methods incorporate domain-specific knowledge but incur significant computational costs for implementation and maintenance. In this paper, we present multiple approaches to utilize LLMs for session-based recommendation: (i) in-context LLMs as recommendation agents, (ii) LLM-generated representations for semantic initialization of deep learning SR models, and (iii) integration of LLMs with data-driven SR models. Through comprehensive experiments on two real-world publicly available datasets, we demonstrate that LLM-based methods excel at coarse-level retrieval (high recall values), while traditional data-driven techniques perform well at fine-grained ranking (high Mean Reciprocal Rank values). Furthermore, the integration of LLMs with data-driven SR models significantly outperforms both standalone LLM approaches and data-driven deep learning models, as well as baseline SR models, in terms of both Recall and MRR metrics.

Session-based Recommendation, Large Language Models

††copyright: acmlicensed††journalyear: 2025††conference: Workshop on Evaluating and Applying Recommendation Systems with Large Language Models at RecSys ’25; September 22–26, 2025; Prague, Czech Republic††booktitle: Proceedings of EARL ’25: Workshop on Evaluating and Applying Recommendation Systems with Large Language Models at RecSys ’25††ccs: Information systems Recommender systems**footnotetext: These authors contributed equally to this work.

1. Introduction

Session-based Recommendation (SR) models are becoming increasingly popular due to their ability to recommend items based only on interactions in the current session for anonymous users. Several SR models employing deep learning architectures are proposed in the literature (Hidasi et al., 2015; Wu et al., 2019; Gupta et al., 2021, 2019; Hou et al., 2022; Kang and McAuley, 2018; Li et al., 2017; Liu et al., 2018; Xie et al., 2022). Recently, motivated by the notable achievements of LLMs, several works try to employ these large models in recommendations (Hu et al., 2024; Liu et al., 2024; Wang et al., 2024; Liu et al., 2025). SAID (Hu et al., 2024) aims to learn item embeddings that align with the textual descriptions of items. Authors proposed a two-stage training scheme. At first stage, SAID employs a projector module to transform an item ID into an embedding and feeds it into an LLM to explicitly to elicit the item’s textual token sequence from the LLM. At second stage, embeddings are used for extracting the entire sequence’s representation for recommendation. LLM-ESR (Liu et al., 2024) obtains semantic embeddings of items and users by encoding prompt texts from LLMs. Authors devise a dual-view modeling framework that combines semantic and collaborative information. Specifically, the embeddings derived from LLMs are frozen to avoid deficiency of semantics. Next, they propose a retrieval augmented self distillation method to enhance the sequence encoder of an SR model using similar users. (Hu et al., 2025) learns ID embeddings in the null space of language embeddings to combine semantic and collaborative knowledge in an optimal way.

In this work, our aim is also to utilize the capability of LLMs in SR models in an efficient and effective manner and propose SemSR: Semantics aware robust Session-based Recommendations; a framework where LLMs capability is incorporated in the form of embedding of items generated using pre-existing LLM. Our approach is different from the existing above approaches as we train SemSR end to end unlike SAID which has two stages of learning, and LLM-ESR that use self distillation method to enhance the sequence encoder of an SR model.

Through extensive experiments on two real-world publicly available datasets, we demonstrate that LLM-based methods excel at coarse-level retrieval (high recall values), while traditional data-driven techniques perform well at fine-grained ranking (high Mean Reciprocal Rank values). Furthermore, SemSR significantly outperforms both standalone LLM approaches and data-driven deep learning models, as well as baseline SR models, in terms of both Recall and MRR metrics.

Two distinct SR models are employed to show the efficacy of our approach i.e., MSGAT (Qiao et al., 2023), and NISER (Gupta et al., 2019). However, the approach is model agnostic and can be employed for any SR model from the literature.

2. Related Work

Session-based Recommendation: SR methods have evolved from traditional techniques such as Markov Chains (Jamali and Ester, 2010; He and McAuley, 2016) and collaborative filtering (Ekstrand et al., 2011; Wang et al., 2019), to advanced deep learning-based approaches. Early methods struggled with capturing complex user behaviors and sequential dependencies, while deep learning models, such as GRU4Rec (Hidasi et al., 2015), SASRec(Kang and McAuley, 2018), and SRGNN(Wu et al., 2019), have significantly improved predictive accuracy by leveraging sequential modelling, self-attention (Vaswani et al., 2017), and graph-based representations (Wang et al., 2021; Gupta et al., 2019, 2024b, 2024a; Qiao et al., 2023).

LLM as Recommender Systems (RS): LLMs have recently demonstrated unparalleled capabilities in natural language understanding, reasoning and beyond (Zhao et al., 2023; Min et al., 2023). The idea of directly utilizing the LLMs as RS have gained significant traction with models such as GPT (Brown et al., 2020), BERT (Devlin et al., 2019), and LLaMA (Touvron et al., 2023) which are trained on vast corpora of text, enabling a rich understanding of semantics. Due to their strong understanding of language and context, LLMs can generate more personalized, and context-aware recommendations compared to conventional models. Recent works (Geng et al., 2022; Dai et al., 2023; Sanner et al., 2023; Ji et al., 2024) have demonstrated that prompt-based LLMs can effectively recommend items by leveraging in-context learning. While prompt-based and fine-tuned LLM based techniques show strong potential, they are sensitive to the design of prompts and often lack up-to-date data knowledge which can lead to irrelevant recommendations in dynamic environments like e-commerce.

LLMs to augment Recommender Systems:

As more and more LLMs have been developed, the research has progressively explored how to utilize the knowlege in LLMs to improve the sequential recommendations. Recently, several works have highlighted the effectiveness of LLMs as components in recommendation tasks (Xi et al., 2024; He et al., 2023; Guo et al., 2024; Qiao et al., 2024; Liu et al., 2024; Hu et al., 2024; Wang et al., 2024; Liu et al., 2025). KAR (Xi et al., 2024) proposes to use language embeddings as additional input in the learning of ID embeddings. LLM4SBR (Qiao et al., 2024) utilizes a a two step strategy. Firstly, session data is transformed into both textual and behavioral modalities, allowing LLMs to infer session intent from textual descriptions. Secondly, SR models use behavioral data to align and average session representations across two different modalities. LLM-ESR(Liu et al., 2024) retrieves semantic embeddings of items and users by encoding prompts from LLMs and uses a retrieval augmented self distillation method to enhance the sequence encoder of an SR model. SAID(Hu et al., 2024), on the other hand, evolves a two-stage training process: the first stage involves generating item embeddings by leveraging the projector module and LLM, and in the second stage, learned item embeddings are input into the sequential model to extract the entire sequence’s representation for recommendation. In contrast to above methods, our proposed method SemSR introduces an end-to-end framework that directly incorporates LLM-generated item embeddings in SR model to obtain top- $K$ recommendations. AlphaFuse (Hu et al., 2025) proposes an approach which injects collaborative signals into the null space of language embeddings which helps in preserving the semantic information. In AlphaFuse, trainable ID embeddings are learned in an orthogonal null space. Our approach is different, it tries to fuse semantic and collaborative signals (trainable) and the entire model is trained jointly.

3. Proposed Framework

This section will introduce the problem definition and details of our approach:

3.1. Problem Definition

Suppose that $\mathcal{S}$ denotes the set of all sessions in the logged data containing user-item interactions (e.g. click/view/order), and $\mathcal{I}$ denotes the set of $n$ items observed in $\mathcal{S}$ . Any session $s\in\mathcal{S}$ is a sequence of item-click events: $s=(i_{s,1},i_{s,2},\ldots,i_{s,|s|})$ , where $i_{s,j}$ ( $j=1\ldots|s|$ ) $\in$ $\mathcal{I}$ , denotes the $j^{th}$ clicked item in session $s$ . The goal of SR is to predict the next item $i_{s,|s|+1}$ as the target class in an $n$ -way classification problem by estimating the $n$ -dimensional item-probability vector $\mathbf{\hat{y}}_{s,|s|+1}$ corresponding to the relevance scores for the $n$ items. The $K$ items with the highest scores constitute the top- $K$ recommendations.

3.2. LLM as Recommender System

We leverage an LLM model (Llama3-8B-Instruct***https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) as an in-context text generation method to generate the title of the next recommended item for a given session which contains a sequence of clicked items titles. We use different in-context prompt design strategies such as few-shot in-context prompting, chain-of-thought (CoT) prompting to generate the next recommended title.

Prompt Design: We utilize different prompting strategies but use a common definition for designing prompts i.e., firstly, we define the role for system, then we define task for the model and get the response title from the LLM. Different prompting strategies are as follows:

•

Few-shot in-context prompt (FS-LLM): In this prompt, in addition, we provide few shot input-output examples to the LLM, followed by the query session and allow the model to predict the next item title. The prompt template for FS-LLM is shown in figure 2.

•

Zero-shot Chain-of-Thought prompt (ZCoT-LLM): In this prompt, we design a two step chain of thought prompt. First we generate rationale and then generate recommendations based on the generated rationale.

•

Few-shot Chain-of-Thought prompt (FSCoT-LLM): Similar to ZCoT-LLM strategy, here we also provide some examples in the prompt to learn input and output relations.

LLM Inference: To retrieve top-k recommendations, we use a vector database (Chromadb***https://docs.trychroma.com/docs/overview/getting-started) to store item embeddings obtained from the Llama3 model. We then retrieve top-k recommendations based on cosine similarity between the generated next item title and stored items in the database.

3.3. SemSR

The overall architecture of SemSR is shown in Figure 1. The following section specifies the implementation details of the architecture.

3.3.1. Semantic-aware Item and Session Representation

Each item is mapped to a $d_{1}$ -dimensional vector from the trainable embedding look-up table to obtain interaction-based item embeddings as $\textbf{I}_{m}=[\textbf{i}_{m_{1}},\textbf{i}_{m_{2}},\dots,\textbf{i}_{m_{n}}]^{T}$ $\in$ $\mathbb{R}^{n\times d_{1}}$ such that each row $\textbf{i}_{m_{j}}\in\mathbb{R}^{d_{1}}$ is the $d_{1}$ -dimensional embedding vector corresponding to item $i_{j}\in\mathcal{I}$ $(j=1,2,...,n)$ . Further, each item is mapped to a $d_{2}$ -dimensional vector by inputting item features like brand, category, price, color, title, description, etc to an LLM model (in this work, we consider Angle (Li and Li, 2023) and LLM2Vec (BehnamGhader et al., 2024) embedding with BERT and LLama-3.1-8B as the backbone model respectively) and obtain LLM-based item embeddings as $\textbf{I}_{l}=[\textbf{i}_{l_{1}},\textbf{i}_{l_{2}},...,\textbf{i}_{l_{n}}]^{T}\in\mathbb{R}^{n\times d_{2}}$ . These embeddings are frozen during model training.

We denote data-driven (interaction based) session embeddings by $\textbf{s}_{m}$ and LLM based session embedding by $\textbf{s}_{l}$ . Consider any function $f$ (e.g., SR model’s neural network from literature (Gupta et al., 2019; Qiao et al., 2023)), parameterized by $\theta$ - maps sequence of items in session $s$ to $\textbf{s}_{m}=f(\textbf{I}_{m,s},\theta)$ , where, $\textbf{I}_{m,s}=[\textbf{i}_{m,s,1},\textbf{i}_{m,s,2},...,\textbf{i}_{m,s,|s|}]^{T}\in\mathbb{R}^{|s|\times d_{1}}$ . For obtaining $\textbf{s}_{l}$ , we compute the soft-attention weight of the $j$ -th item in session $s$ as $\alpha_{j}=\textbf{q}^{T}sigmoid(\textbf{W}_{1}\textbf{i}_{{l,s,|s|}}+\textbf{W}_{2}\textbf{i}_{l,s,j}+c)$ , where $(j=1,2,...,|s|-1)$ , $\textbf{q},\textbf{c}\in\mathbb{R}^{d_{2}}$ , $\textbf{W}_{1},\textbf{W}_{2}\in\mathbb{R}^{d_{2}\times d_{2}}$ , and $\textbf{i}_{l,s,|s|}$ is the most recent item in session $s$ . The $\alpha_{j}$ ’s are further normalized using a softmax operation yielding intermediate session embedding $\textbf{s}^{\prime}=\sum_{j=1}^{|s-1|}\alpha_{j}\textbf{i}_{l,s,j}$ . The LLM-based session embedding $\textbf{s}_{l}$ is a linear transformation over the concatenation of intermediate session embedding $\textbf{s}^{\prime}$ and the embedding of the most recent item $\textbf{i}_{l,s,|s|}$ , s.t. $\textbf{s}_{l}=\textbf{W}_{3}[\textbf{s}^{\prime};\textbf{i}_{l,s,|s|}]$ , where $\textbf{W}_{3}\in\mathbb{R}^{d_{2}\times 2d_{2}}$ .

Finally, semantic aware session embeddings are obtained by linear transformation over the concatenation of $\textbf{s}_{m}$ and $\textbf{s}_{l}$ as $\textbf{s}=\textbf{W}_{4}[\textbf{s}_{m};\textbf{s}_{l}$ ], where $\textbf{W}_{4}\in\mathbb{R}^{d\times(d_{1}+d_{2})}$ . Further, semantic aware item embeddings are computed as linear transformation over the concatenation of $\textbf{I}_{m}$ and $\textbf{I}_{l}$ as $\textbf{I}=\textbf{W}_{5}[\textbf{I}_{m};\textbf{I}_{l}$ ], where $\textbf{W}_{5}\in\mathbb{R}^{d\times(d_{1}+d_{2})}$ . The semantic aware item and session embeddings are then used to obtain the relevance score for next clicked item $i_{k}$ computed as,

[TABLE]

3.3.2. Training and Inferencing of SemSR

The goal is to obtain s that is close to the embedding $\mathbf{i}_{s,|s|+1}$ of the target item $i_{k}=i_{s,|s|+1}$ , where $k$ is estimated class for the target item, $k=\operatorname*{arg\,max}_{j}~\mathbf{i}_{j}^{T}\mathbf{s}$ with $j=1\ldots n$ . For this $n$ -way classification task, softmax (cross-entropy) loss is used during training for estimating $\bm{\theta}$ by minimizing the sum of $\mathcal{L(\hat{\mathbf{y}})}=-\sum_{j=1}^{m}\mathbf{y}_{j}\text{log}(\hat{\mathbf{y}}_{j})$ over all training samples, where $\mathbf{y}\in\{0,1\}^{n}$ is a 1-hot vector with $\mathbf{y}_{k}=1$ corresponding to the correct (target) class $k$ .

During inference, the final recommendation scores for the $n$ items are computed by eq. 1. The top- $K$ items are considered as the recommended items.

4. Experimental Evaluation

In this section, we conduct extensive experiments to answer the following research questions:

•

RQ1: Are in-context LLMs’ predictions competitive with data driven SR methods?

•

RQ2: Does semantic-aware representations with data-driven models improve performance?

•

RQ3: Does re-ranking top-K recommendations improve performance in-terms of MRR?

We consider two SR models from the literature to demonstrate the efficacy of SemSR, i.e., MSGAT (Qiao et al., 2023), and NISER (Gupta et al., 2019). We compared SemSR with exiting SR models from literature e.g., NARM (Li et al., 2017) and SRGNN (Belieni and Mesquita, 2025).

Dataset Details: We consider two datasets i.e., english dataset from Amazon KDDCup challenge 2023 (AmazonKDD-M2 (UK))***https://www.aicrowd.com/challenges/amazon-kdd-cup-23-multilingual-recommendation-challenge and Amazon Beauty 2014 review dataset (Beauty) ***https://cseweb.ucsd.edu/ jmcauley/datasets/amazon/links.html to evaluate our approach. After preprocessing, the statistics of the datasets are shown in table 1.

AmazonKDD-M2 (UK): We use a real-world multilingual recommendation dataset from KDDCup challenge 2023 to evaluate the effectiveness of SemSR. The dataset provides items and sessions in six different languages but for this paper non-english sessions are filtered out yielding $4,99,611$ items with average sessions length of $4.12$ . We consider data from task 1 of Amazon KDDCup challenge 2023. The number of given training/testing sessions considered are $11,72,181$ / $1,15,936$ respectively. For LLM as RS experiments, we selected a subset of the data to minimize cost. We selected the most recent $10k$ sessions as the test set based on chronological splits to minimize the computational cost.

Amazon-Beauty: We consider the Amazon-Beauty sub-category from the Amazon 2014 review dataset which contains timestamped user-item interactions from May 1996 to July 2014, and metadata contains items’ title, descriptions, categories, brands, price, etc. We only consider users having more than 5 reviews, filtering out less popular items that have a frequency less than 5 and removed sessions of length less than 2 from the data. We split sessions based on user ids and consider $80\%$ : $10\%$ : $10\%$ split of data into training, validation and test set, respectively. Average sessions length is $6.40$ . Further, we create incremental sessions to improve training of the SR model.

SemSR and its variants: We proposed the following variants of our approach:

•

Semantic Initialization via LLM embeddings (SemSR-I): Item embeddings are initialized by LLM embeddings for MSGAT (Qiao et al., 2023), and NISER (Gupta et al., 2019) (denoted as SemMSGAT-I, SemNISER-I) respectively.

•

Semantically initialized model fused with SR (SemSR-F): LLM based item embeddings are concatenated with data driven item embeddings followed by a linear transformation. Similarly, LLM based session embedding are fused with data driven session embeddings. SemMSGAT-F, SemNISER-F are variants for MSGAT, and NISER, respectively.

•

SemSR-I+, SemSR-F+: Re-ranked recommendations list obtained from SemSR-I, and SemSR-F using NISER model.

Evaluation Metrics: We use the standard offline evaluation metrics Recall $@K$ and Mean Reciprocal Rank (MRR $@K$ ). Recall $@$ K represents the proportion of test instances which has the desired item in the top-K items. MRR $@$ K (Mean Reciprocal Rank) is the average of reciprocal rank of desired item in recommendation list.

Hyperparameter Setup: We use validation data for hyperparameter selection using Recall $@100$ as the performance metric for all approaches except in-context LLMs based approach as this does not require model training. We use the adam optimizer with mini-batch size $100$ , momentum $0.9$ , $d_{1}=100$ , $d_{2}=1024$ , and $d=100$ . For NISER and its SemSR variants, we use the same parameters (Gupta et al., 2019) i.e., scaling factor $=16.0$ and learning rate $0.001$ . For MSGAT, we grid-search over learning rate $in$ { $(0.1,0.01,0.001,0.0001)$ }. The best learning rate on the validation set is $lr=0.001$ .

4.1. Results and Discussion

From 2, we observe that MSGAT, and NISER performs significantly better that existing baselines NARM and SRGNN.

•

RQ1: LLM as RS vs data-driven SR model, In table 3, we compare LLM as recommendation methods FS-LLM, ZCOT-LLM, and FSCoT-LLM with traditional data-driven SR models. We observe that while LLM as RS methods demonstrate some capability in recommendation task, their performance significantly lags behind data-driven models such as MSGAT and NISER. For example, FSCoT-LLM achieves R@20 and MRR@20 score of 31.70 and 10.00, respectively on the Amazon-M2 dataset, and only 7.04 and 1.86, respectively on Amazon-Beauty dataset. These results suggest that although LLMs may offer generalization benefits, they currently lack the collaborative knowledge which is necessary for session-based recommendation.

•

RQ2: Effectiveness of Semantic Integration, From table 2, we observe that semantic initializations of embeddings significantly enhances performance at a coarse level of retrieval, i.e., SemMSGAT-I and SemNISER-I perform significantly better than respective vanilla SR models in terms of recall on the Amazon-M2 (UK) dataset and are comparable on the Amazon-Beauty dataset. From table 2, we also observe that fusion-based models SemMSGAT-F and SemNISER-F consistently outperform their vanilla SR models as well as other existing methods in terms of recall as well as MRR which further emphasizes the advantages of incorporating semantic information into the data-driven SR models.

•

RQ3: Assessing the benefits of Re-ranking, From table 2, we observe that re-ranking using vanilla SR methods helps to improve fine-grained ranking i.e., it further improves MRR for all semantic variants SemMSGAT-I+, SemMSGAT-F+, SemNiser-I+ and SemNiser-F+ over SemMSGAT-I, SemMSGAT-F, SemNiser-I and SemNiser-F, respectively.

•

From figure 3, we observe that for lower $K$ value (50 and 100), vanilla SR models MSGAT and NISER perform better than MSGAT-I and NISER-I, respectively. However, for higher $K$ values (K=200, …, 500) trend is reversed. Moreover, the performance gap between MSGAT vs MSGAT-I and NISER vs NISER-I widens as K increases. This suggests that semantic initialization via LLM embeddings excel at coarse-level retrieval.

5. Conclusion

In this work, we highlighted various methods of leveraging LLMs for SR, i.e., in-context prompting and retrieval via LLMs, and integrating LLM-based embeddings with deep learning based state-of-the-art SR models i.e., MSGAT (Qiao et al., 2023) and NISER (Gupta et al., 2019). We showed the comparison of different approaches incorporating LLMs, and demonstrated that LLMs can help to improve the performance of SR models by understanding semantics of items and their features/meta-information.

Bibliography40

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Behnam Ghader et al. (2024) Parishad Behnam Ghader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. 2024. Llm 2vec: Large language models are secretly powerful text encoders. ar Xiv preprint ar Xiv:2404.05961 (2024).
3Belieni and Mesquita (2025) Juan Belieni and Diego Mesquita. 2025. SRGNN: simple recurrent graph neural network. Proceeding Series of the Brazilian Society of Computational and Applied Mathematics 11, 1 (2025), 1–2.
4Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
5Dai et al. (2023) Sunhao Dai, Ninglu Shao, Haiyuan Zhao, Weijie Yu, Zihua Si, Chen Xu, Zhongxiang Sun, Xiao Zhang, and Jun Xu. 2023. Uncovering chatgpt’s capabilities in recommender systems. In Proceedings of the 17th ACM Conference on Recommender Systems . 1126–1132.
6Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers) . 4171–4186.
7Ekstrand et al. (2011) Michael D Ekstrand, John T Riedl, Joseph A Konstan, et al. 2011. Collaborative filtering recommender systems. Foundations and Trends® in Human–Computer Interaction 4, 2 (2011), 81–173.
8Geng et al. (2022) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2022. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p 5). In Proceedings of the 16th ACM conference on recommender systems . 299–315.