Evaluating Recabilities of Foundation Models: A Multi-Domain, Multi-Dataset Benchmark

Qijiong Liu; Jieming Zhu; Yingxin Lai; Xiaoyu Dong; Lu Fan; Zhipeng Bian; Zhenhua Dong; Xiao-Ming Wu

arXiv:2508.21354·cs.IR·September 1, 2025

Evaluating Recabilities of Foundation Models: A Multi-Domain, Multi-Dataset Benchmark

Qijiong Liu, Jieming Zhu, Yingxin Lai, Xiaoyu Dong, Lu Fan, Zhipeng Bian, Zhenhua Dong, Xiao-Ming Wu

PDF

Open Access

TL;DR

This paper introduces RecBench-MD, a comprehensive benchmark for evaluating the recommendation capabilities of foundation models across multiple datasets and domains, highlighting the importance of fine-tuning and multi-domain training.

Contribution

The study presents RecBench-MD, a new benchmark for assessing foundation models' recommendation abilities across diverse datasets and domains, with extensive evaluations of 19 models.

Findings

01

In-domain fine-tuning yields the best performance.

02

Cross-dataset transfer learning supports new recommendation scenarios.

03

Multi-domain training improves model adaptability.

Abstract

Comprehensive evaluation of the recommendation capabilities of existing foundation models across diverse datasets and domains is essential for advancing the development of recommendation foundation models. In this study, we introduce RecBench-MD, a novel and comprehensive benchmark designed to assess the recommendation abilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective. Through extensive evaluations of 19 foundation models across 15 datasets spanning 10 diverse domains -- including e-commerce, entertainment, and social media -- we identify key characteristics of these models in recommendation tasks. Our findings suggest that in-domain fine-tuning achieves optimal performance, while cross-dataset transfer learning provides effective practical support for new recommendation scenarios. Additionally, we observe that multi-domain training…

Tables9

Table 1. Table 1 : Comparison of our RecBench-MD with existing benchmarks. “ – ” indicates that RecBole-CDR theoretically supports the corresponding feature, although no experimental results are provided.

Benchmark		Zhang et al.	OpenP5	LLMRec	PromptRec	Jiang et al.	RSBench	RecBole-CDR	RecBench	RecBench-MD
Year		2021	2024	2023	2024b	2024	2024a	2022	2025b	(ours)
Scale	#Foundation Models	4	2	7	4	7	1	0	17	19
Scale	#Dataset	1	10	1	3	4	3	3	5	15
Setting	Zero-shot	$✓$	$\times$	$✓$	$✓$	$\times$	$\times$	$\times$	$✓$	$✓$
	Single-Dataset	$✓$	$✓$	$✓$	$\times$	$✓$	$✓$	–	$✓$	$✓$
	In-domain Cross-dataset	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	–	$\times$	$✓$
	In-domain Multi-dataset	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	–	$\times$	$✓$
	Cross-domain	$\times$	$\times$	$\times$	$✓$	$\times$	$\times$	–	$\times$	$✓$
	Multi-domain	$\times$	$✓$	$\times$	$\times$	$\times$	$\times$	–	$\times$	$✓$
Approach	Prompt-based	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$	$\times$	$✓$	$✓$
Approach	Embedding-based	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	–	$\times$	$✓$
Metric	Quality	$✓$	$✓$	$✓$	$✓$	$✓$	$✓$	–	$✓$	$✓$
Metric	Efficiency	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	$\times$	$✓$	$✓$

Table 2. Table 2 : Datasets evaluated or finetuned in our benchmark.

Table 3. Table 3 : Performance comparison in single-domain fine-tuning scenario. We use cell background color to indicate different regimes, including , , and . Only AUC metric is reported due to page limits.

Table 4. Table 4 : Performance comparison in cross-dataset fine-tuning scenario. We use cell background color to indicate different regimes, including , , , and . We mark the top-5 rank finetune set for each test set and bold the best. We use red color to indicate the result inferior to the zero-shot one or small than 0.5. Only AUC metric is reported due to page limits.

Table 5. Table 5 : Performance comparison in multi-domain fine-tuning scenario. We use cell background color to indicate different settings, including , , and . Only AUC metric is reported due to page limits.

Table 6. Table 6 : Performance comparison across various fine-tuning datasets and orders . Each sub-table presents 25 AUC scores on a test set. For the entry at row i i , column j j (e.g., i = 1 i=1 , j = 2 j=2 in the MIND sub-table), the model is first fine-tuned on POG, then on PENS, yielding an AUC of 0.6687 when tested on MIND. “Overall” indicates the average performance across the corresponding five test sets. The diagonal cells (highlighted in grey) represent results of single-dataset fine-tuning. We rank these five values and annotate the rank next to each dataset name in the column header (e.g., PENS (1) in the MIND sub-table indicates that 0.6823 is the highest result in single-dataset fine-tuning). For each pair of datasets, different fine-tuning orders generally yield significantly different results. The superior result for each pair is highlighted in green. The top five results among the 25 entries within each sub-table are annotated with their respective rankings.

Table 7. Table 7 : Performance comparison in multi-domain recommendation scenario, with the evaluation metrics. Experiments are conducted on the MIND and Micro. datasets. We bold the best results for each metric.

Table 8. Table 8 : Performance comparison in multi-domain recommendation scenario, with the evaluation metrics. Experiments are conducted on the Micro. and Yelp datasets. We bold the best results for each metric.

Table 9. Table 9 : Performance comparison in multi-domain recommendation scenario, with the evaluation metrics. Experiments are conducted on the Good. and CDs datasets. We bold the best results for each metric.

Equations9

L = - \sum [y lo g \overset{y}{^} + (1 - y) lo g (1 - \overset{y}{^})],

L = - \sum [y lo g \overset{y}{^} + (1 - y) lo g (1 - \overset{y}{^})],

\overset{y}{^} = \frac{e ^{l_{yes}}}{e ^{l_{yes}} + e ^{l_{no}}} .

\overset{y}{^} = \frac{e ^{l_{yes}}}{e ^{l_{yes}} + e ^{l_{no}}} .

\overset{y}{^} = \frac{u \cdot t}{∥ u ∥∥ t ∥},

\overset{y}{^} = \frac{u \cdot t}{∥ u ∥∥ t ∥},

RRA@K = \frac{1}{T} i = 1 \sum T (I (r_{i} <= K) \cdot \frac{1}{r _{i}}),

RRA@K = \frac{1}{T} i = 1 \sum T (I (r_{i} <= K) \cdot \frac{1}{r _{i}}),

where indicator function I (b) = 1 if b is true else 0,

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks

Full text

Evaluating Recabilities of Foundation Models:

A Multi-Domain, Multi-Dataset Benchmark

Qijiong Liu1, Jieming Zhu2, Yingxin Lai3, Xiaoyu Dong1,

Lu Fan1, Zhipeng Bian4, Zhenhua Dong2, **Xiao-Ming Wu1

1**The Hong Kong Polytechnic University 2Huawei Noah’s Ark Lab, Shenzhen, China

3Xiamen University, Xiamen, China 4Shenzhen University, Shenzhen, China

[email protected]

Abstract

Comprehensive evaluation of the recommendation capabilities of existing foundation models across diverse datasets and domains is essential for advancing the development of recommendation foundation models. In this study, we introduce RecBench-MD, a novel and comprehensive benchmark designed to assess the recommendation abilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective. Through extensive evaluations of 19 foundation models across 15 datasets spanning 10 diverse domains—including e-commerce, entertainment, and social media—we identify key characteristics of these models in recommendation tasks. Our findings suggest that in-domain fine-tuning achieves optimal performance, while cross-dataset transfer learning provides effective practical support for new recommendation scenarios. Additionally, we observe that multi-domain training significantly enhances the adaptability of foundation models. All code111https://github.com/Jyonn/RecBench-MD and data222https://www.kaggle.com/datasets/qijiong/recbench-md have been publicly released to facilitate future research.

1 Introduction

The rapid emergence of foundation models, particularly large language models (LLMs), has revolutionized various fields such as natural language processing (NLP) (Touvron et al., 2023a; Reid et al., 2024) and computer vision (Kirillov et al., 2023; Li et al., 2024). Recently, their application in recommender systems has attracted considerable interest, as these models promise a unified framework capable of modeling user–item interactions through natural language (Wu et al., 2024a; Zhao et al., 2024; Bao et al., 2023b). Despite the existence of numerous foundation models, most are primarily designed for NLP tasks, and there is currently a lack of effective strategies for selecting appropriate models to develop recommendation foundation models. Consequently, assessing the recommendation abilities, referred to as Recabilities, of foundation models has become increasingly important.

Recommendation foundation models, akin to LLMs with general NLP capabilities, should exhibit broad zero-resource 333In this paper, zero-resource means fine-tuning on some datasets and testing on unseen ones (cross-dataset), while zero-shot means testing without any fine-tuning. Recabilities, allowing for inference on unseen datasets or even novel domains. This necessitates a comprehensive evaluation of the recommendation abilities of existing foundation models across various datasets, domains, training and evaluation strategies, and recommendation tasks and approaches. Although efforts such as LLMRec (Liu et al., 2023) and PromptRec (Wu et al., 2024b) exist, these studies primarily concentrate on a single domain or dataset using one recommendation approach, leading to a constrained evaluation scope and partial conclusions.

To address these challenges, we introduce a multi-domain recommendation taxonomy that examines all applicable scenarios across eight settings, as depicted in Figure 1, ranging from to . Initially, recommendation models were developed for individual datasets, corresponding to (single-domain single-dataset). Subsequently, researchers explored cross-domain recommendation models, represented by (cross-domain cross-dataset), which aim to transfer user interest knowledge from a source domain to a target domain. Additionally, some studies have investigated the integration of multiple domains to train a unified model, corresponding to (multi-domain multi-dataset), transitioning from a one-dataset-one-model paradigm to a multiple-dataset-one-model paradigm. More recently, several studies have assessed the zero-shot Recabilities of foundation models. However, these evaluations are often limited to a single dataset (), which can lead to unreliable and biased results. A more robust approach is to evaluate across multiple datasets and compute the average, as illustrated in , i.e., zero-shot multi-domain.

In this work, we present a comprehensive benchmark, RecBench-MD, specifically designed to evaluate the Recabilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective, encompassing all settings illustrated in Figure 1. This study is pioneering in its benchmarking of cross-dataset recommendation for zero-resource settings. We have specifically examined a range of recommendation approaches, including prompt-based ranking tasks and embedding-based matching tasks, thereby covering the main recommendation scenarios. Our evaluation is unprecedented in scope, encompassing 15 recommendation datasets across 10 domains and 19 foundation models. Furthermore, we provide open-source code and datasets, facilitating easy evaluation for future large-scale recommendation or foundation models with simple configuration. Our experiments required an impressive 1,000 GPU hours, and the platform’s reusability significantly reduces experimental costs for future researchers, allowing them to concentrate more on model optimization and algorithm innovation.

Our benchmarking results reveals several key insights. First, larger models tend to benefit more from joint training on multiple datasets or domains, exhibiting stronger cross-domain generalization. Second, the degree of transferability across domains varies considerably, with a strong dependence on the characteristics of the source dataset. Third, while in-domain datasets exhibit higher relevance, this is not universally observed across all scenarios. Fourth, cross-dataset transfer can serve as an effective model warm-up strategy in novel recommendation contexts, though it is challenging to exceed the performance upper bound established by fine-tuning on single or multiple datasets within the target domain.

2 Related Work

Existing Benchmarks. Several benchmarks have been proposed to evaluate the Recabilities for foundation models, including LLMRec (Liu et al., 2023), PromptRec (Wu et al., 2024b), and others (Zhang et al., 2021; Jiang et al., 2024; Liu et al., 2024a). However, as illustrated in Table 1, these benchmarks i) provide only a limited evaluation of recommendation settings, often focusing on a single approach. In addition, ii) the number of foundation models and datasets evaluated remains relatively small, resulting in an incomplete and fragmented performance landscape in this domain.

Multi-domain Recommendation. Traditional multi-domain recommendation methods predominantly rely on item-based or user-based knowledge transfer, using common items or shared user interactions to mitigate data sparsity and domain discrepancies (Guo et al., 2021; Chen et al., 2022, 2019). However, such approaches require explicit entity-level overlap between domains–a condition rarely met in real-world scenarios (Zhu et al., 2021; Zang et al., 2022). In contrast, text-based knowledge transfer leverages rich entity-side information, such as item descriptions and user profiles, used in diverse features (Chen et al., 2013; Gao et al., 2013). Borrowing the semantic comprehension and generation capabilities of foundation models, text-based methods boost cross-domain learning without the need for explicit entity alignment, i.e., non-overlap for users and items, thereby offering a more flexible and robust framework for transferring knowledge across heterogeneous domains.

Foundation Models for Recommendation. In recent years, integrating large language models (LLMs) into recommender systems has attracted significant academic and industrial interest. These integrations can be broadly classified into two paradigms (Wu et al., 2024a; Zhao et al., 2024; Bao et al., 2023b; Chen et al., 2024): LLM-for-RS and LLM-as-RS. The LLM-for-RS paradigm enhances traditional recommenders via feature engineering or encoding techniques using LLMs (Wei et al., 2024; Liu et al., 2024b, c, 2025a; Wu et al., 2023; Zhou et al., 2025; Hu et al., 2024). In contrast, the LLM-as-RS paradigm employs LLMs directly as recommenders (Ngo and Nguyen, 2024; Li et al., 2023; Geng et al., 2022; Liu et al., 2024d). Studies have demonstrated its superior accuracy in contexts such as cold-start scenarios (Bao et al., 2023a) and in tasks requiring natural language understanding and generation (Luo et al., 2023; Wang et al., 2023; He et al., 2023).

3 Proposed Benchmark: RecBench-MD

3.1 Recommendation Settings

In bottom-level text-based knowledge transfer, we can freely collect training data as long as: (i) each item is described by textual content, and (ii) each user is represented by their item consumption sequence. To systematically explore how cross-domain data influences target-domain recommendation performance, we propose a novel taxonomy comprising eight fine-tuning settings, as illustrated in Figure 1:

****** (Zero-resource) Zero-shot Single-dataset.** The model is directly evaluated on a single dataset without any fine-tuning. This setting measures the model’s intrinsic Recability.

****** (Zero-resource) Zero-shot Multi-domain.** A more comprehensive zero-shot evaluation: the model is tested on multiple datasets from different domains, and performance is averaged to assess generalization.

****** Single-domain Single-dataset.** Fine-tuning and evaluation are performed on the same dataset. This setting reflects standard in-domain supervised learning.

****** (Zero-resource) Single-domain Cross-dataset.** The model is fine-tuned on one or more datasets within a domain and evaluated on a different dataset from the same domain. It assesses domain-level generalization across datasets.

****** Single-domain Multi-dataset.** Training and testing data are drawn from multiple datasets within the same domain, with potential overlap. This setting measures the benefit of aggregating in-domain data.

****** (Zero-resource) Cross-domain Cross-dataset.** The model is fine-tuned on one domain and tested on a completely different one. This setting probes cross-domain transferability of recommendation knowledge.

****** (Zero-resource) Multi-domain Cross-dataset.** Training and testing datasets come from overlapping but non-identical domains. This setting evaluates how auxiliary domain knowledge contributes to target performance when datasets do not overlap.

****** Multi-domain Multi-dataset.** Both domains and datasets overlap between training and testing. This setting examines the upper-bound performance achievable via comprehensive domain and dataset fusion.

3.2 Recommendation Approaches

We evaluate Recabilities of foundation models with the pair-wise user–item click prediction task. It involves the estimation of the probability $\hat{y}$ that a user will interact positively with a candidate item. Therefore, the models will be trained by the binary cross-entropy (BCE) loss, formulated as:

[TABLE]

where $y\in\{0,1\}$ denotes the ground-truth label. Borrowing the idea from conventional recommendation, including matching-based and ranking-based models, we devise two recommendation approaches to calculate the click probabilities.

Prompt-based Recommendation. We concatenate the user sequence with the candidate item where each item in the sequence and the candidate item are represented by their textual feature. Then, the entire user-item sequence will be in conjunction with a task-specific instruction (e.g., "Will the user be interested in this item? Answer (Yes or No):"). Next, the model is guided to predict specific output tokens (i.e., “Yes” or “No”), and their corresponding logits, $l_{\text{yes}}$ and $l_{\text{no}}$ . Finally, the click probability $\hat{y}$ can be denoted as:

[TABLE]

Embedding-based Recommendation. Following matching-based two-tower paradigm, here the foundation models are employed as user and item encoders learn their dense representations (embeddings) within a shared latent space. Specifically, we use the last token output embedding for user/item representation when the input is the user sequence or the candidate item. The click probability can be subsequently measured by the cosine similarity:

[TABLE]

where $\cdot$ denotes the dot product operation, $\|\|$ represents the L2 norm, and $\mathbf{u}$ and $\mathbf{t}$ are user and item representations.

4 Experimental Setup

Datasets. To meaningfully probe foundation model capabilities in recommendation beyond prevalent single-dataset evaluations, a deliberately heterogeneous suite of 15 public datasets across 10 domains was assembled. This collection’s scale and diversity (Table 2) are necessary to stress-test the central premise of foundation model generalization across varied recommendation contexts, spanning high-volume consumer arenas (e.g., fashion, news) to specialized niches (e.g., games, hotels). The inherent heterogeneity manifesting in item taxonomies, user interaction dynamics, textual signal richness, and sparsity levels is instrumental, leveraged to transcend potentially idiosyncratic single-domain observations and evaluate genuine cross-domain. For each dataset, the fine-tuning set is randomly split into a training set and validation set in a 9:1 ratio.

Foundation Models. We collected 19 foundation models from different perspectives to evaluate their Recabilities, including: BERT ${}_{\text{base}}$ (Kenton and Toutanova, 2019), OPT ${}_{\text{350M}}$ (Zhang et al., 2022), OPT ${}_{\text{1B}}$ (Zhang et al., 2022), Llama-1 ${}_{\text{7B}}$ (Touvron et al., 2023a), Llama-2 ${}_{\text{7B}}$ (Touvron et al., 2023b), Llama-3 ${}_{\text{8B}}$ (Dubey et al., 2024), Llama-3.1 ${}_{\text{8B}}$ (Meta AI, 2024), GPT-3.5 (OpenAI, 2023), Qwen-2 ${}_{\text{500M}}$ (Yang et al., 2024), Qwen-2 ${}_{\text{1.5B}}$ (Yang et al., 2024), Qwen-2 ${}_{\text{7B}}$ (Yang et al., 2024), GLM-4 ${}_{\text{9B}}$ (GLM et al., 2024), Misrtal-2 ${}_{\text{7B}}$ (Jiang et al., 2023), DS-Qwen-2 ${}_{\text{7B}}$ (Bi et al., 2024), E5 ${}_{\text{base-v2}}$ (Wang et al., 2022), Phi-2 ${}_{\text{3B}}$ (Javaheripi et al., 2023), RecGPT ${}_{\text{7B}}$ (Ngo and Nguyen, 2024), P5 ${}_{\text{beauty}}$ (Geng et al., 2022), and Recformer (Li et al., 2023). We present a comprehensive comparison across multiple dimensions: varying model sizes within the same organization (e.g., Qwen-2 series), different versions from the same organization (e.g., Llama series), models of similar size released in the same year by different organizations (e.g., THU’s GLM-4 ${}_{\text{9B}}$ , Meta’s Llama-3 ${}_{\text{8B}}$ , and Alibaba’s Qwen-2 ${}_{\text{7B}}$ in 2024), and models targeting different domains (e.g., the general foundation model Llama vs. the recommendation foundation model RecGPT). Specifically, the closed-source GPT-3.5 model from OpenAI supports only the prompt-based recommendation paradigm due to the unavailability of item and user embeddings. In contrast, models like Recformer and E5 ${}_{\text{base-v2}}$ , designed with a dual-tower architecture, can only be evaluated with the embedding-based paradigm.

Evaluation Protocols. Following common practice (Liu et al., 2025b), we evaluate recommendation performance using widely adopted metrics, including ranking metrics such as GAUC, nDCG, and MRR, as well as matching metrics like F1 and Recall. However, due to space limitations, we present only the GAUC (shortly AUC) metric mostly. The full evaluation results will available on our webpage.

Additionally, we also design the Reciprocal Rank Average (RRA) metric to evaluate the contribution for each finetune set (used in Table 4). Specifically, we mark the top-K finetune set for each test set, and calculate the top-K RRA metric by:

[TABLE]

where $T$ is the number of the test datasets, $r_{i}$ is the rank of the model on the $i$ -th dataset, $K$ is the rank threshold (e.g., $K=5$ ).

Implementation Details. During data preprocessing, we standardized datasets of varying original sizes to comparable scales: the test set contains approximately 20,000 samples, while the fine-tuning set consists of around 100,000 samples. For each dataset, items were carefully curated to retain the most representative textual content features. User behavior sequences were truncated to a maximum length of 20; if a sequence exceeded this limit, only the most recent interactions were preserved.

We fine-tune models using LoRA (Hu et al., 2022) (Low-Rank Adaptation), a parameter-efficient strategy with rank 32 and alpha 128. The learning rate is set to 1e-4 across all experiments, using the Adam optimizer. An effective batch size of 32 is maintained via gradient accumulation, and early stopping is applied with a patience of 2. Models are built and evaluated using the Huggingface Transformers library (Wolf et al., 2019). For BERT ${}_{\text{base}}$ , OPT ${}_{\text{1B}}$ , and Llama-3 ${}_{\text{8B}}$ , the maximum sequence lengths are 512, 1024, and 1024, respectively, with precision set to float32 for BERT and bfloat16 for OPT ${}_{\text{1B}}$ and Llama-3 ${}_{\text{8B}}$ . To reduce fine-tuning overhead for embedding-based architectures, we freeze lower layers of OPT ${}_{\text{1B}}$ and Llama-3 ${}_{\text{8B}}$ , applying LoRA only to the top two layers. When fine-tuning on multiple datasets, early stopping is based on the average validation AUC across datasets. We will release the code, data, checkpoints, and documentation at our GitHub repository.

All the experiments are conducted on a single Nvidia A100 GPU device. Except for the zero-shot setting, all results are averaged over five runs, with statistically significant differences observed ( $p<0.05$ ).

5 Findings and Discussions

In this section, we present a comprehensive analysis of experimental results evaluating the foundation model Recabilities in diverse fine-tuning regimes and different evaluation tasks.444Due to space limits, more experimental results are provided in the appendix and supplementary material.

5.1 Zero-shot Multi-domain: Prompt-based vs. Embedding-based

Here, we investigate the zero-shot Recabilities of various foundation models. For each dataset, we identify the maximum and minimum AUC values across all evaluated models in both paradigms (with the minimum constrained to 0.5) and normalize the results accordingly, as shown in Figure 2. Based on these findings, we make the following observations:

First, for almost all datasets, the prompt-based evaluation paradigm outperforms the embedding-based one, as it aligns more closely with the pre-training objectives of foundation models.

Second, under the prompt-based paradigm, three LLMs (Misrtal-2 ${}_{\text{7B}}$ , GLM-4 ${}_{\text{9B}}$ , Qwen-2 ${}_{\text{7B}}$ ) exhibit superior performance, possibly due to the inclusion of the collaborative signals during pre-training. In contrast, P5 ${}_{\text{beauty}}$ performs well on two Amazon datasets (CDs and Electronics) but less favorably on others, likely because the used checkpoint was trained on the Amazon Beauty dataset, thereby modeling Amazon user interests.

Thirdly, in the embedding-based paradigm, performance differences among models are less pronounced. Notably, smaller models (such as BERT ${}_{\text{base}}$ and OPT ${}_{\text{1B}}$ ) perform better under this setting than in the prompt-based paradigm, whereas the embeddings of larger models appear less sensitive to similarity metrics, in line with the findings of (Freestone and Santu, 2024). Additionally, the matching-based language model E5 ${}_{\text{base-v2}}$ and recommendation model Recformer also demonstrate strong performance, benefiting from the consistency between the evaluation and training paradigms.

5.2 Single-domain Fine-tuning: vs.

Here, we study the single-domain fine-tuning recommendation scenario. We mainly select three foundation models, i.e., BERT ${}_{\text{base}}$ , OPT ${}_{\text{1B}}$ , and Llama-3 ${}_{\text{8B}}$ , of three distinct model size for the evaluation. From results displayed in Table 3, we can make the following observations:

First, compared to zero-shot baselines (), domain-specific fine-tuning strategies ( and ) consistently achieve superior performance on both prompt-based and embedding-based paradigm. This is primarily because large models have acquired domain-specific collaborative knowledge through fine-tuning.

Second, fine-tuning with a single-domain single-dataset setting () yields more stable performance than the cross-dataset variant (), even within the same domain. This is likely due to optimization conflicts between datasets, as observed on Goodreads and H&M, where underperforms compared to .

Third, large-scale foundation models (e.g., Llama-3 ${}_{\text{8B}}$ ) achieve the best performance under the , as their pretraining enables a broad understanding of general textual knowledge across domains, allowing them to effectively extract and generalize useful patterns from auxiliary datasets to the target dataset. In contrast, smaller models such as BERT ${}_{\text{base}}$ are less suited for , as they struggle to abstract transferable patterns even from datasets within the same domain, leading to limited performance gains.

5.3 Cross-dataset Fine-tuning: vs.

Here, we study effect of the cross-dataset fine-tuning, including single-domain and cross-domain scenario. The experiments are conducted across two foundation models: BERT ${}_{\text{base}}$ and Llama-3 ${}_{\text{8B}}$ . The foundation model will be firstly fine-tuned with one single dataset and then evaluated over 10 test datasets. We design a RRA metric (Equation 4) to evaluate the usefulness of each finetune set.

Based Table 4, we can make the following observations:

First, cross-dataset fine-tuning generally improves recommendation performance, but it may also introduce negative effects on the target dataset in some cases (as indicated by the red-highlighted results in the table). Notably, the Yelp and Hotel. datasets exhibit a higher likelihood of such degradation, possibly due to domain gaps and mismatches between the test sets and finetune sets. Moreover, for the Llama-3 ${}_{\text{8B}}$ model, Micro. and Movie. also demonstrate performance degradation under cross-dataset finetuning. Interestingly, these two datasets are where Llama-3 ${}_{\text{8B}}$ achieves the highest zero-shot performance among the 10 test sets. This suggests that Llama-3 ${}_{\text{8B}}$ likely encountered collaborative signals related to these domains during pretraining, allowing it to effectively capture user interests for video-based recommendations even without additional tuning.

Second, single-domain cross-dataset finetuning is not always more effective than cross-domain finetuning. While it intuitively makes sense that user interests are easier to model within the same domain–supported by results in news ( MIND– PENS) and books ( Good.– Books)–this trend does not hold for movies, music, and fashion: their results did not even rank in the top five. A possible reason is that MIND and PENS both originate from Microsoft, and Amazon is the source of Books as well as the parent company of Goodreads.com, suggesting these dataset pairs may share more similar distributions.

Third, dataset quality varies, but its effectiveness also depends heavily on the capacity of the pretrained model. For instance, finetuning on Good. with BERT ${}_{\text{base}}$ () ranks only fifth, while using Llama-3 ${}_{\text{8B}}$ lifts it to first. This may be because Goodreads relies on book titles as content features, which are poorly represented in smaller models’ pretraining corpora. In contrast, Llama-3 ${}_{\text{8B}}$ better understands textual content, leading to more robust item representations and improved user modeling. According to the Top-5 RRA results, CDs and H&M offer the strongest transferability, while POG performs the weakest. Additionally, Good. and Last.fm show large performance gains when switching from BERT ${}_{\text{base}}$ to Llama-3 ${}_{\text{8B}}$ , suggesting complex content features paired with highly transferable user interests. On the other hand, MIND and Micro. show ranking drops, indicating their simpler content may already be sufficiently modeled by smaller models, but their user behavior patterns are less suitable for cross-dataset transfer. Finally, although Books and Last.fm do not have corresponding test sets, their finetuned models still rank in the top five under Llama-3 ${}_{\text{8B}}$ , suggesting strong generalization capability across domains.

5.4 Multi-domain Fine-tuning: vs.

We investigate the impact of multi-domain fine-tuning, focusing on two key settings: multi-domain cross-dataset () and multi-domain multi-dataset (). In the setting, foundation models are fine-tuned using datasets from domains different from the test sets, specifically: POG, PENS, Netflix, Books, and Last.fm. In contrast, the setting involves fine-tuning on datasets that share domains with the test sets but do not include overlapping data, namely: H&M, MIND, Micro., Good., and CDs. From the results illustrated in Table 5, we can make the following observations:

First, achieves the best performance on H&M, MIND, Micro., Good., and CDs across all three foundation models, as it is fine-tuned directly on these datasets and thus captures domain-specific knowledge effectively. Second, although uses different datasets for fine-tuning, it consistently outperforms the zero-shot setting (), highlighting the generalization benefits of multi-domain training with diverse user behavior patterns. Third, while serves as an upper bound, the performance gap between and narrows with larger models–for instance, the improvement on HM drops from 30.0% (BERT ${}_{\text{base}}$ ) to 16.8% (Llama-3 ${}_{\text{8B}}$ ), suggesting that large models fine-tuned on cross-domain data can better handle zero-resource scenarios. Finally, due to its reliance on specific data distributions, underperforms on five other datasets ( Movie., Yelp, Steam, Elec., Hotel.), underscoring the fairness and robustness of the multi-domain cross-dataset setting.

6 Conclusion

We have introduced RecBench-MD, a novel and comprehensive benchmark designed to evaluate the recommendation capabilities of foundation models across a wide range of datasets and domains. Our thorough analysis of 19 foundation models across 15 datasets and 10 domains provides crucial insights into their performance in recommendation tasks. The findings demonstrate the substantial advantages of cross-dataset transfer learning and multi-domain training in improving the adaptability of foundation models. We expect that these insights, along with the valuable resources provided, will drive future advancements in the development of recommendation foundation models, offering a strong foundation for continued research and innovation in this field.

Appendix A Limitations

In this study, we assess the recommendation capabilities of foundation models on two of the most prevalent tasks: prompt-based approaches (similar to CTR models) and embedding-based approaches (akin to matching models), from a multi-dataset, multi-domain perspective. Nonetheless, our current evaluation does not encompass sequential recommendation, which represents a crucial area for future development and enhancement.

Appendix B Broader Impacts

Our benchmark offers a comprehensive and scalable framework for evaluating foundation models in zero-resource, multi-dataset, multi-domain recommendation scenarios, thereby promoting more systematic and reproducible research. It establishes a solid foundation for ongoing research and innovation in this field. Furthermore, the benchmark facilitates cross-domain fine-tuning, extending its benefits to other areas such as natural language processing.

Appendix C Technical Appendices

C.1 Impact of Fine-tuning Dataset Order In Multi-domain Recommendation

Previously, Table 5 compares three evaluation strategies: , , and . Both and involve training on a mixture of all available fine-tune sets. In contrast, we analyze a sequential fine-tuning strategy here, focusing on how to select datasets and determine fine-tuning order. Given the combinatorial complexity of all five datasets ( $A_{5}^{5}$ ), we restrict our analysis to pairs of fine-tune sets. Based on the results from Table 6, we can observe that:

First, in most cases, the value at row $i$ , column $j$ exceeds that of the corresponding single-dataset fine-tuning result in row $j$ , column $j$ , indicating that using two datasets generally provides greater benefit than using only one. However, it does not necessarily surpass the value at row $i$ , column $i$ , since knowledge learned from dataset $i$ , the first step, may be subject to catastrophic forgetting during continual fine-tuning with dataset $j$ .

Second, building on this observation, the dataset used in the later stage (second step) of fine-tuning tends to have a dominant influence on the final performance. For example, in columns corresponding to datasets that achieved the best single-dataset results, green highlights are commonly observed (e.g., the PENS column when the test set is H&M, or the Books column when the test set is Good.). To further investigate this effect, we present a more detailed analysis in Figure 3, showing that datasets with stronger single-dataset performance are generally more effective when used in the second fine-tuning step.

Third, we further investigate which fine-tuning combinations are most likely to yield Top-5 performance. We hypothesize that this is related to the single-dataset performance of the fine-tune sets. To examine this, we present Figure 4. The results suggest that using a lower-ranked dataset in the first step, followed by the top-performing dataset in the second step, tends to produce the best outcomes for the target test set.

C.2 Additional Evaluation Metrics

In the main text, we report only the AUC metric due to the space constraints. Here, we provide additional evaluation metrics, including nDCG@1, nDCG@5, MRR, Recall@1, and Recall@5, for a more comprehensive comparison.

As shown in Table 7, Table 8, and Table 9, other metrics generally align with the AUC results, supporting the consistency of our findings. We will release the complete experimental results on our website.

Bibliography53

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Bao et al. [2023 a] Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems , pages 1007–1014, 2023 a.
2Bao et al. [2023 b] Keqin Bao, Jizhi Zhang, Yang Zhang, Wang Wenjie, Fuli Feng, and Xiangnan He. Large language models for recommendation: Progresses and future directions. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages 306–309, 2023 b.
3Bi et al. [2024] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. ar Xiv preprint ar Xiv:2401.02954 , 2024.
4Chen et al. [2022] Chaochao Chen, Huiwen Wu, Jiajie Su, Lingjuan Lyu, Xiaolin Zheng, and Li Wang. Differential private knowledge transfer for privacy-preserving cross-domain recommendation. In Proceedings of the ACM web conference 2022 , pages 1455–1465, 2022.
5Chen et al. [2019] Chong Chen, Min Zhang, Chenyang Wang, Weizhi Ma, Minming Li, Yiqun Liu, and Shaoping Ma. An efficient adaptive transfer neural network for social-aware recommendation. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval , pages 225–234, 2019.
6Chen et al. [2024] Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web , 27(4):42, 2024.
7Chen et al. [2013] Wei Chen, Wynne Hsu, and Mong Li Lee. Making recommendations from multiple domains. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 892–900, 2013.
8Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783 , 2024.

Dataset	Domain	Symbol	Test set			Finetune set			Used Attributes
Dataset	Domain	Symbol	#Sample	#Item	#User	#Sample	#Item	#User	Used Attributes
H&M	Fashion	H&M	20,000	15,305	5,000	100,000	50,319	25,000	detail_desc
MIND	News	MIND	20,006	3,088	1,514	100,000	5,481	7,606	title
MicroLens	Video	Micro.	20,000	11,073	5,000	100,000	18,658	25,000	title
Goodreads	Book	Good.	20,009	12,984	1,736	100,005	40,322	8,604	original_title
Amazon CDs	Music	CDs	20,003	15,568	4,930	100,003	55,428	24,618	title
MovieLens	Movie	Movie.	20,008	4,300	2,251	-	-	-	title
Yelp	Restaurant	Yelp	20,003	15,239	4,013	-	-	-	name
Steam	Game	Steam	20,000	2,216	5,000	-	-	-	game_name
Amazon Electronics	E-commerce	Elec.	20,002	11,045	5,431	-	-	-	title
HotelRec	Hotel	Hotel.	20,002	17,295	5,437	-	-	-	name, location
POG	Fashion	POG	-	-	-	100,002	15,846	15,734	title_en
PENS	News	PENS	-	-	-	100,007	9,053	8,542	title
Netflix	Video	Netflix	-	-	-	100,010	3,645	13,424	title
Amazon Books	Book	Books	-	-	-	100,002	28,471	25,139	title
LastFM	Music	Last.fm	-	-	-	100,100	94,319	910	track, artist

	H&M	MIND	Micro.	Good.	CDs	Movie.	Yelp	Steam	Elec.	Hotel.	RRA@5
Foundation Model: BERT $_{base}$
N/A	0.5204	0.4963	0.4992	0.4958	0.5059	0.4934	0.4914	0.5002	0.5037	0.4955	-
H&M	0.8701 (1)	0.5496 (4)	0.5692 (3)	0.5282 (1)	0.5103 (3)	0.5127 (4)	0.4961	0.7291 (3)	0.5304 (3)	0.4869	0.3833 (1)
MIND	0.6750 (3)	0.7118 (1)	0.5877 (2)	0.5255 (4)	0.5128 (2)	0.4932	0.5024 (5)	0.7184 (4)	0.5306 (2)	0.4847	0.3533 (3)
Micro.	0.6661 (4)	0.5841 (3)	0.8148 (1)	0.5097	0.5093 (5)	0.5150 (3)	0.4864	0.7393 (1)	0.5004	0.4807	0.3117 (4)
Good.	0.6218	0.5081	0.5239	0.5208 (5)	0.4992	0.4957	0.5105 (4)	0.6220	0.5168	0.4952 (4)	0.0700 (8)
CDs	0.6464 (5)	0.5053	0.5152	0.5139	0.6185 (1)	0.5503 (2)	0.5356 (1)	0.4794	0.5076	0.5216 (1)	0.3700 (2)
POG	0.6222	0.5153	0.5470	0.5138	0.4989	0.4953	0.4913	0.6171	0.5291 (4)	0.4914 (5)	0.0450 (10)
PENS	0.6872 (2)	0.6203 (2)	0.5554 (5)	0.5165	0.5051	0.5069	0.4987	0.7311 (2)	0.5218	0.4900	0.1700 (7)
Netflix	0.6191	0.5396 (5)	0.5328	0.5080	0.5097 (4)	0.5656 (1)	0.5117 (3)	0.6954 (5)	0.5255 (5)	0.5077 (2)	0.2683 (5)
Books	0.6108	0.5108	0.5295	0.5264 (2)	0.5089	0.5119 (5)	0.5155 (2)	0.6191	0.5313 (1)	0.4957 (3)	0.2533 (6)
Last.fm	0.6279	0.5127	0.5645 (4)	0.5263 (3)	0.5023	0.4855	0.4773	0.6699	0.5231	0.4769	0.0583 (9)
Foundation Model: Llama-3 $_{8B}$
N/A	0.5267	0.4904	0.6412	0.5577	0.5191	0.7690	0.5136	0.5454	0.5223	0.5342	-
H&M	0.8606 (1)	0.5693 (5)	0.6758 (2)	0.6116 (2)	0.5268 (5)	0.6818 (4)	0.4911	0.9193 (2)	0.5469 (4)	0.5116	0.3400 (2)
MIND	0.6599	0.7120 (1)	0.6104 (5)	0.5279	0.5176	0.5235	0.4808	0.8033	0.5284	0.5008	0.1200 (9)
Micro.	0.7504 (3)	0.6331 (3)	0.8295 (1)	0.5829	0.5239	0.6703	0.5165	0.8953 (4)	0.5250	0.4900	0.1917 (7)
Good.	0.6592	0.5249	0.5731	0.6799 (1)	0.5334 (4)	0.6550	0.5301 (3)	0.8795 (5)	0.6051 (3)	0.5320 (4)	0.2367 (5)
CDs	0.6262	0.5019	0.5385	0.5840 (5)	0.6267 (1)	0.7201 (2)	0.5939 (1)	0.6427	0.6410 (1)	0.6057 (3)	0.4033 (1)
POG	0.5922	0.4912	0.5175	0.5068	0.5011	0.5354	0.5147 (4)	0.5684	0.4896	0.5142 (5)	0.0450 (10)
PENS	0.7517 (2)	0.6823 (2)	0.6675 (3)	0.5670	0.5191	0.6311	0.5134 (5)	0.8979 (3)	0.5353	0.4860	0.1867 (8)
Netflix	0.6655 (5)	0.4844	0.5634	0.5694	0.5355 (3)	0.7422 (1)	0.4958	0.8547	0.5947 (3)	0.6104 (2)	0.2367 (5)
Books	0.5750	0.5040	0.5322	0.5876 (3)	0.5600 (2)	0.6935 (3)	0.5727 (2)	0.6373	0.6267 (2)	0.6149 (1)	0.2833 (3)
Last.fm	0.7444 (4)	0.5807 (4)	0.6567 (4)	0.5871 (4)	0.5045	0.6736 (5)	0.4818	0.9363 (1)	0.5414 (5)	0.5125	0.2400 (4)

Model	H&M	MIND	Micro.	Good.	CDs	Movie.	Yelp	Steam	Elec.	Hotel.
BERT $_{base}$	0.5204	0.4963	0.4992	0.4958	0.5059	0.4934	0.4914	0.5002	0.5037	0.4955
	0.6607	0.6004	0.5711	0.5213	0.5092	0.5675	0.5218	0.6189	0.5091	0.4795
	0.8569	0.7004	0.8019	0.5648	0.5653	0.5106	0.5093	0.6817	0.5039	0.4869
OPT $_{1B}$	0.5650	0.5338	0.5236	0.5042	0.4994	0.5174	0.5026	0.3825	0.5205	0.5026
	0.7002	0.5996	0.6165	0.5189	0.5181	0.6156	0.4853	0.7665	0.5446	0.5037
	0.8658	0.7259	0.8132	0.5374	0.6220	0.5813	0.5014	0.6959	0.5248	0.4951
Llama-3 $_{8B}$	0.7690	0.4904	0.6412	0.5577	0.5191	0.5267	0.5136	0.5454	0.5223	0.5342
	0.7295	0.6732	0.6223	0.5864	0.5626	0.7203	0.5764	0.7828	0.6296	0.5806
	0.8524	0.7206	0.8235	0.6660	0.6281	0.6683	0.5846	0.7042	0.6139	0.5318

	H&M					MIND
	POG (4)	PENS (1)	Netflix (3)	Books (5)	Last.fm (2)	POG (4)	PENS (1)	Netflix (5)	Books (3)	Last.fm (2)
POG	0.5922	0.7287	0.6722	0.5824	0.7496 (4)	0.4912	0.6687 (5)	0.5221	0.4863	0.5901
PENS	0.6827	0.7517 (2)	0.7338	0.6036	0.7355	0.6636	0.6823 (1)	0.6186	0.5270	0.6482
Netflix	0.7099	0.7439	0.6655	0.6442	0.7502 (3)	0.5530	0.6757 (3)	0.4844	0.4992	0.5866
Books	0.7356	0.7450 (5)	0.6923	0.5750	0.7444	0.5248	0.6759 (2)	0.5223	0.5040	0.5785
Last.fm	0.7367	0.7592 (1)	0.7143	0.6096	0.7444	0.5440	0.6738 (4)	0.5039	0.5149	0.5807
	Micro.					Good.
	POG (5)	PENS (1)	Netflix (3)	Books (4)	Last.fm (2)	POG (5)	PENS (4)	Netflix (3)	Books (1)	Last.fm (2)
POG	0.5175	0.6443	0.5753	0.5304	0.6495	0.5068	0.5659	0.5631	0.5700	0.5858
PENS	0.6262	0.6675	0.6157	0.5508	0.6439	0.5632	0.5670	0.5810	0.6024 (1)	0.5375
Netflix	0.6222	0.6693 (2)	0.5634	0.5650	0.6688 (3)	0.5690	0.5721	0.5694	0.5973 (2)	0.5743
Books	0.6328	0.6662 (4)	0.5784	0.5322	0.6462	0.5822	0.5947 (5)	0.5950 (4)	0.5876	0.5864
Last.fm	0.6363	0.6779 (1)	0.5875	0.5338	0.6567 (5)	0.5540	0.5728	0.5748	0.5953 (3)	0.5871
	CDs					Overall
	POG (5)	PENS (3)	Netflix (2)	Books (1)	Last.fm (4)	POG (5)	PENS (1)	Netflix (3)	Books (4)	Last.fm (2)
POG	0.5011	0.5225	0.5254	0.5517	0.4947	0.5218	0.6260 (5)	0.5716	0.5442	0.6139
PENS	0.5140	0.5191	0.5366	0.5551 (5)	0.5023	0.6099	0.6375 (4)	0.6171	0.5678	0.6135
Netflix	0.5237	0.5293	0.5355	0.5602 (2)	0.5098	0.5956	0.6381 (3)	0.5636	0.5732	0.6179
Books	0.5563 (4)	0.5340	0.5387	0.5600 (3)	0.5039	0.6063	0.6432 (1)	0.5853	0.5518	0.6119
Last.fm	0.5098	0.5263	0.5259	0.5654 (1)	0.5045	0.5962	0.6420 (2)	0.5813	0.5638	0.6147
	Movie.					Yelp
	POG (5)	PENS (4)	Netflix (1)	Books (2)	Last.fm (3)	POG (2)	PENS (3)	Netflix (4)	Books (1)	Last.fm (5)
POG	0.5354	0.6216	0.7460 (3)	0.6649	0.6702	0.5147	0.4937	0.5192	0.5660 (5)	0.4927
PENS	0.5890	0.6311	0.7486 (2)	0.6929	0.5829	0.5295	0.5134	0.5186	0.5859 (1)	0.5077
Netflix	0.7054	0.7012	0.7422 (5)	0.7144	0.7110	0.5130	0.5138	0.4958	0.5725 (4)	0.4825
Books	0.7061	0.6832	0.7458 (4)	0.6935	0.6843	0.5597	0.5249	0.5062	0.5727 (3)	0.5089
Last.fm	0.6406	0.6688	0.7487 (1)	0.6737	0.6736	0.5133	0.5218	0.4950	0.5754 (2)	0.4818
	Steam					Elec.
	POG (5)	PENS (2)	Netflix (3)	Books (4)	Last.fm (1)	POG (5)	PENS (4)	Netflix (2)	Books (1)	Last.fm (3)
POG	0.5684	0.8902	0.8679	0.6110	0.9299 (3)	0.4896	0.5255	0.5720	0.6274 (4)	0.5340
PENS	0.8452	0.8979	0.8781	0.7137	0.8827	0.5426	0.5353	0.5885	0.6380 (2)	0.5211
Netflix	0.8826	0.9076	0.8547	0.7431	0.9297 (4)	0.5559	0.5468	0.5947	0.6419 (1)	0.5462
Books	0.7833	0.9043	0.8674	0.6373	0.9307 (2)	0.5741	0.5564	0.6078	0.6267 (5)	0.5506
Last.fm	0.8633	0.9191 (5)	0.8899	0.6743	0.9363 (1)	0.5445	0.5290	0.5918	0.6294 (3)	0.5414
	Hotel.					Overall
	POG (3)	PENS (5)	Netflix (2)	Books (1)	Last.fm (4)	POG (5)	PENS (4)	Netflix (1)	Books (3)	Last.fm (2)
POG	0.5142	0.4903	0.5947	0.6004	0.4990	0.5245	0.6043	0.6600	0.6139	0.6252
PENS	0.5006	0.4860	0.5892	0.6156 (2)	0.5040	0.6092	0.6127	0.6646 (2)	0.6492	0.5997
Netflix	0.5399	0.5188	0.6104 (4)	0.6204 (1)	0.5189	0.6394	0.6376	0.6596 (4)	0.6585	0.6377
Books	0.5622	0.5033	0.6006	0.6149 (3)	0.5216	0.6371	0.6344	0.6656 (1)	0.6290	0.6392 (5)
Last.fm	0.4921	0.4875	0.5903	0.6055 (5)	0.5125	0.6108	0.6252	0.6631 (3)	0.6317	0.6291

First Step	Second Step	AUC	nDCG@1	nDCG@5	MRR	Recall@1	Recall@5
MIND
POG	PENS	0.6687	0.5179	0.5767	0.5687	0.1865	0.5808
PENS	POG	0.6636	0.4905	0.5653	0.5513	0.1653	0.5643
POG	Netflix	0.5221	0.3234	0.4072	0.4333	0.1100	0.4249
Netflix	POG	0.5530	0.3461	0.4401	0.4607	0.1232	0.4711
POG	Books	0.4863	0.2675	0.3636	0.4009	0.0866	0.3944
Books	POG	0.5248	0.3154	0.4090	0.4330	0.1001	0.4365
POG	Last.fm	0.5901	0.4206	0.4918	0.5019	0.1514	0.5071
Last.fm	POG	0.5440	0.3506	0.4377	0.4562	0.1170	0.4644
PENS	Books	0.5270	0.2999	0.3987	0.4309	0.1042	0.4360
Books	PENS	0.6759	0.5342	0.5811	0.5752	0.1903	0.5777
PENS	Netflix	0.6186	0.4205	0.5067	0.5110	0.1451	0.5228
Netflix	PENS	0.6757	0.5208	0.5813	0.5743	0.1847	0.5886
PENS	Last.fm	0.6482	0.4630	0.5440	0.5319	0.1539	0.5597
Last.fm	PENS	0.6738	0.5182	0.5777	0.5719	0.1813	0.5814
Books	Netflix	0.5223	0.2990	0.4026	0.4289	0.0973	0.4326
Netflix	Books	0.4992	0.2688	0.3774	0.4150	0.0904	0.4128
Books	Last.fm	0.5785	0.3932	0.4759	0.4878	0.1373	0.5025
Last.fm	Books	0.5149	0.3005	0.3941	0.4294	0.1057	0.4276
Netflix	Last.fm	0.5866	0.4026	0.4821	0.4958	0.1428	0.5032
Last.fm	Netflix	0.5039	0.3012	0.3857	0.4249	0.1013	0.4178
Micro.
POG	PENS	0.6443	0.6782	0.6782	0.7648	0.3301	1.0000
PENS	POG	0.6262	0.6638	0.8450	0.7433	0.3107	1.0000
POG	Netflix	0.5753	0.5945	0.8201	0.7204	0.2873	1.0000
Netflix	POG	0.6222	0.6665	0.8441	0.7554	0.3292	1.0000
POG	Books	0.5304	0.5257	0.7972	0.6889	0.2511	1.0000
Books	POG	0.6328	0.6635	0.8463	0.7607	0.3276	1.0000
POG	Last.fm	0.6495	0.6867	0.6867	0.7749	0.3423	1.0000
Last.fm	POG	0.6363	0.6705	0.8484	0.7622	0.3307	1.0000
PENS	Books	0.5508	0.5653	0.8096	0.7094	0.2780	1.0000
Books	PENS	0.6662	0.7094	0.8628	0.7828	0.3496	1.0000
PENS	Netflix	0.6157	0.6423	0.8387	0.7473	0.3137	1.0000
Netflix	PENS	0.6693	0.7028	0.8624	0.7850	0.3483	1.0000
PENS	Last.fm	0.6439	0.6742	0.8511	0.7619	0.3258	1.0000
Last.fm	PENS	0.6779	0.7162	0.8671	0.7921	0.3550	1.0000
Books	Netflix	0.5784	0.5939	0.8211	0.7248	0.2913	1.0000
Netflix	Books	0.5650	0.5841	0.8161	0.7192	0.2868	1.0000
Books	Last.fm	0.6462	0.6879	0.8540	0.7722	0.3413	1.0000
Last.fm	Books	0.5338	0.5435	0.8011	0.6999	0.2687	1.0000
Netflix	Last.fm	0.6688	0.7104	0.8636	0.7873	0.3534	1.0000
Last.fm	Netflix	0.5875	0.6080	0.8251	0.7334	0.3019	1.0000

First Step	Second Step	AUC	nDCG@1	nDCG@5	MRR	Recall@1	Recall@5
Movie.
POG	PENS	0.6216	0.4486	0.5870	0.5467	0.2146	0.7081
PENS	POG	0.5890	0.3970	0.5600	0.5149	0.1873	0.6875
POG	Netflix	0.7460	0.5643	0.7022	0.6516	0.2757	0.8207
Netflix	POG	0.7054	0.5360	0.6671	0.6224	0.2648	0.7896
POG	Books	0.6649	0.4753	0.6223	0.5757	0.2297	0.7523
Books	POG	0.7061	0.5277	0.6685	0.6181	0.2592	0.7952
POG	Last.fm	0.6702	0.5270	0.6434	0.6072	0.2610	0.7577
Last.fm	POG	0.6406	0.4749	0.6085	0.5677	0.2337	0.7323
PENS	Books	0.6929	0.5158	0.6487	0.6072	0.2552	0.7728
Books	PENS	0.6832	0.5330	0.6531	0.6079	0.2594	0.7668
PENS	Netflix	0.7486	0.5926	0.7075	0.6648	0.2934	0.8187
Netflix	PENS	0.7012	0.5546	0.6680	0.6262	0.2732	0.7790
PENS	Last.fm	0.5829	0.3782	0.5432	0.4812	0.1570	0.6479
Last.fm	PENS	0.6688	0.5217	0.6417	0.6024	0.2559	0.7530
Books	Netflix	0.7458	0.5815	0.7041	0.6576	0.2861	0.8183
Netflix	Books	0.7144	0.5544	0.6732	0.6279	0.2723	0.7901
Books	Last.fm	0.6843	0.5417	0.6553	0.6196	0.2690	0.7688
Last.fm	Books	0.6737	0.4782	0.6307	0.5871	0.2357	0.7659
Netflix	Last.fm	0.7110	0.5660	0.6785	0.6417	0.2823	0.7916
Last.fm	Netflix	0.7487	0.5902	0.7092	0.6657	0.2934	0.8239
Yelp
POG	PENS	0.4937	0.4761	0.7172	0.6378	0.2227	0.8871
PENS	POG	0.5295	0.5086	0.7383	0.6575	0.2385	0.9011
POG	Netflix	0.5192	0.5164	0.7356	0.6576	0.2461	0.8985
Netflix	POG	0.5130	0.4964	0.7289	0.6572	0.2445	0.8998
POG	Books	0.5660	0.5535	0.7599	0.6842	0.2684	0.9166
Books	POG	0.5597	0.5421	0.7553	0.6813	0.2645	0.9152
POG	Last.fm	0.4927	0.4765	0.7168	0.6465	0.2360	0.8917
Last.fm	POG	0.5133	0.5043	0.7281	0.6559	0.2451	0.8938
PENS	Books	0.5859	0.5732	0.7695	0.6981	0.2815	0.9208
Books	PENS	0.5249	0.5097	0.7358	0.6605	0.2461	0.9022
PENS	Netflix	0.5186	0.5078	0.7340	0.6587	0.2476	0.9009
Netflix	PENS	0.5138	0.5029	0.7303	0.6552	0.2437	0.9002
PENS	Last.fm	0.5077	0.4995	0.7262	0.6482	0.2365	0.8930
Last.fm	PENS	0.5218	0.5019	0.7321	0.6595	0.2441	0.9003
Books	Netflix	0.5062	0.5105	0.7294	0.6543	0.2467	0.8953
Netflix	Books	0.5725	0.5635	0.7640	0.6909	0.2752	0.9175
Books	Last.fm	0.5089	0.4963	0.7266	0.6568	0.2453	0.8972

First Step	Second Step	AUC	nDCG@1	nDCG@5	MRR	Recall@1	Recall@5
Good.
POG	PENS	0.5659	0.2801	0.4068	0.4011	0.1319	0.5098
PENS	POG	0.5632	0.2638	0.4017	0.3902	0.1224	0.5006
POG	Netflix	0.5631	0.2773	0.4025	0.3992	0.1322	0.5060
Netflix	POG	0.5690	0.2705	0.4091	0.4026	0.1316	0.5236
POG	Books	0.5700	0.2913	0.4137	0.4081	0.1391	0.5173
Books	POG	0.5822	0.3072	0.4315	0.4214	0.1483	0.5418
POG	Last.fm	0.5858	0.2866	0.4311	0.4247	0.1420	0.5541
Last.fm	POG	0.5540	0.2752	0.3944	0.3972	0.1339	0.4980
PENS	Books	0.6024	0.3263	0.4463	0.4380	0.1604	0.5559
Books	PENS	0.5947	0.3389	0.4488	0.4387	0.1619	0.5490
PENS	Netflix	0.5810	0.3013	0.4289	0.4209	0.1475	0.5383
Netflix	PENS	0.5721	0.2969	0.4172	0.4155	0.1434	0.5228
PENS	Last.fm	0.5375	0.2373	0.3741	0.3605	0.0994	0.4657
Last.fm	PENS	0.5728	0.2925	0.4182	0.4143	0.1408	0.5236
Books	Netflix	0.5950	0.3267	0.4455	0.4354	0.1596	0.5504
Netflix	Books	0.5973	0.3559	0.4529	0.4440	0.1743	0.5487
Books	Last.fm	0.5864	0.3162	0.4433	0.4342	0.1567	0.5616
Last.fm	Books	0.5953	0.3181	0.4452	0.4346	0.1552	0.5588
Netflix	Last.fm	0.5743	0.2949	0.4227	0.4212	0.1463	0.5397
Last.fm	Netflix	0.5748	0.3099	0.4234	0.4224	0.1526	0.5294
CDs
POG	PENS	0.5225	0.5763	0.7969	0.7180	0.2760	0.9523
PENS	POG	0.5140	0.5798	0.7964	0.7105	0.2706	0.9515
POG	Netflix	0.5254	0.5911	0.8011	0.7250	0.2865	0.9537
Netflix	POG	0.5237	0.5805	0.7978	0.7238	0.2861	0.9544
POG	Books	0.5517	0.6209	0.8137	0.7394	0.3027	0.9584
Books	POG	0.5563	0.6182	0.8126	0.7418	0.3030	0.9557
POG	Last.fm	0.4947	0.5583	0.7864	0.7129	0.2763	0.9495
Last.fm	POG	0.5098	0.5785	0.7941	0.7184	0.2840	0.9515
PENS	Books	0.5551	0.6242	0.8156	0.7443	0.3078	0.9591
Books	PENS	0.5340	0.5878	0.8029	0.7276	0.2854	0.9553
PENS	Netflix	0.5366	0.5969	0.8038	0.7326	0.2941	0.9534
Netflix	PENS	0.5293	0.5829	0.8001	0.7259	0.2845	0.9548
PENS	Last.fm	0.5023	0.5619	0.7910	0.7171	0.2756	0.9511
Last.fm	PENS	0.5263	0.5866	0.7974	0.7258	0.2879	0.9497
Books	Netflix	0.5387	0.6029	0.8064	0.7337	0.2948	0.9555
Netflix	Books	0.5602	0.6296	0.8170	0.7453	0.3100	0.9590
Books	Last.fm	0.5039	0.5656	0.7886	0.7174	0.2809	0.9477