Evaluating Recabilities of Foundation Models: A Multi-Domain, Multi-Dataset Benchmark
Qijiong Liu, Jieming Zhu, Yingxin Lai, Xiaoyu Dong, Lu Fan, Zhipeng Bian, Zhenhua Dong, Xiao-Ming Wu

TL;DR
This paper introduces RecBench-MD, a comprehensive benchmark for evaluating the recommendation capabilities of foundation models across multiple datasets and domains, highlighting the importance of fine-tuning and multi-domain training.
Contribution
The study presents RecBench-MD, a new benchmark for assessing foundation models' recommendation abilities across diverse datasets and domains, with extensive evaluations of 19 models.
Findings
In-domain fine-tuning yields the best performance.
Cross-dataset transfer learning supports new recommendation scenarios.
Multi-domain training improves model adaptability.
Abstract
Comprehensive evaluation of the recommendation capabilities of existing foundation models across diverse datasets and domains is essential for advancing the development of recommendation foundation models. In this study, we introduce RecBench-MD, a novel and comprehensive benchmark designed to assess the recommendation abilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective. Through extensive evaluations of 19 foundation models across 15 datasets spanning 10 diverse domains -- including e-commerce, entertainment, and social media -- we identify key characteristics of these models in recommendation tasks. Our findings suggest that in-domain fine-tuning achieves optimal performance, while cross-dataset transfer learning provides effective practical support for new recommendation scenarios. Additionally, we observe that multi-domain training…
| Benchmark | Zhang et al. | OpenP5 | LLMRec | PromptRec | Jiang et al. | RSBench | RecBole-CDR | RecBench | RecBench-MD | |
| Year | 2021 | 2024 | 2023 | 2024b | 2024 | 2024a | 2022 | 2025b | (ours) | |
| Scale | #Foundation Models | 4 | 2 | 7 | 4 | 7 | 1 | 0 | 17 | 19 |
| #Dataset | 1 | 10 | 1 | 3 | 4 | 3 | 3 | 5 | 15 | |
| Setting | Zero-shot | |||||||||
| Single-Dataset | – | |||||||||
| In-domain Cross-dataset | – | |||||||||
| In-domain Multi-dataset | – | |||||||||
| Cross-domain | – | |||||||||
| Multi-domain | – | |||||||||
| Approach | Prompt-based | |||||||||
| Embedding-based | – | |||||||||
| Metric | Quality | – | ||||||||
| Efficiency |
| Dataset | Domain | Symbol | Test set | Finetune set | Used Attributes | ||||
| #Sample | #Item | #User | #Sample | #Item | #User | ||||
| H&M | Fashion |
|
20,000 | 15,305 | 5,000 | 100,000 | 50,319 | 25,000 | detail_desc |
| MIND | News |
|
20,006 | 3,088 | 1,514 | 100,000 | 5,481 | 7,606 | title |
| MicroLens | Video |
|
20,000 | 11,073 | 5,000 | 100,000 | 18,658 | 25,000 | title |
| Goodreads | Book |
|
20,009 | 12,984 | 1,736 | 100,005 | 40,322 | 8,604 | original_title |
| Amazon CDs | Music |
|
20,003 | 15,568 | 4,930 | 100,003 | 55,428 | 24,618 | title |
| MovieLens | Movie |
|
20,008 | 4,300 | 2,251 | - | - | - | title |
| Yelp | Restaurant |
|
20,003 | 15,239 | 4,013 | - | - | - | name |
| Steam | Game |
|
20,000 | 2,216 | 5,000 | - | - | - | game_name |
| Amazon Electronics | E-commerce |
|
20,002 | 11,045 | 5,431 | - | - | - | title |
| HotelRec | Hotel |
|
20,002 | 17,295 | 5,437 | - | - | - | name, location |
| POG | Fashion |
|
- | - | - | 100,002 | 15,846 | 15,734 | title_en |
| PENS | News |
|
- | - | - | 100,007 | 9,053 | 8,542 | title |
| Netflix | Video |
|
- | - | - | 100,010 | 3,645 | 13,424 | title |
| Amazon Books | Book |
|
- | - | - | 100,002 | 28,471 | 25,139 | title |
| LastFM | Music |
|
- | - | - | 100,100 | 94,319 | 910 | track, artist |
|
|
|
|
|
|
|||||||
| Prompt | Embedding | Prompt | Embedding | Prompt | Embedding | Prompt | Embedding | Prompt | Embedding | |||
| Fine-tune set | N/A | |||||||||||
| BERT | 0.5204 | 0.5167 | 0.4963 | 0.5263 | 0.4992 | 0.5305 | 0.4958 | 0.5160 | 0.5059 | 0.5139 | ||
| OPT | 0.5650 | 0.6370 | 0.5338 | 0.5510 | 0.5236 | 0.5447 | 0.5042 | 0.5258 | 0.4994 | 0.5137 | ||
| Llama-3 | 0.5454 | 0.6487 | 0.4904 | 0.5666 | 0.5577 | 0.5218 | 0.5191 | 0.5150 | 0.5136 | 0.5162 | ||
| Misrtal-2 | 0.7166 | 0.6051 | 0.6300 | 0.5607 | 0.6579 | 0.5329 | 0.5718 | 0.5240 | 0.5230 | 0.5198 | ||
| Qwen-2 | 0.7124 | 0.6201 | 0.5862 | 0.5347 | 0.6640 | 0.5391 | 0.5494 | 0.5190 | 0.5256 | 0.5212 | ||
| P5 | 0.7124 | 0.6201 | 0.4911 | 0.5948 | 0.5017 | 0.6423 | 0.5027 | 0.5186 | 0.5447 | 0.5218 | ||
| Fine-tune set | ||||||||||||
| BERT | 0.8701 | 0.8359 | 0.7118 | 0.6837 | 0.8148 | 0.7671 | 0.5208 | 0.5738 | 0.6185 | 0.5909 | ||
| OPT | 0.5155 | 0.8163 | 0.7344 | 0.6788 | 0.7735 | 0.7545 | 0.4985 | 0.5754 | 0.5808 | 0.5713 | ||
| Llama-3 | 0.8606 | 0.8150 | 0.7120 | 0.6745 | 0.8295 | 0.7430 | 0.6799 | 0.5944 | 0.6267 | 0.5835 | ||
| Fine-tune set | ||||||||||||
| BERT | 0.8009 | 0.8373 | 0.6834 | 0.6823 | 0.5480 | 0.7538 | 0.5169 | 0.5537 | 0.5790 | 0.5181 | ||
| OPT | 0.5208 | 0.7941 | 0.7350 | 0.6885 | 0.5800 | 0.7443 | 0.5966 | 0.5500 | 0.4926 | 0.5497 | ||
| Llama-3 | 0.5378 | 0.8095 | 0.7350 | 0.6765 | 0.8368 | 0.7557 | 0.6325 | 0.5638 | 0.6353 | 0.5484 | ||
|
|
|
|
|
|
|
|
|
|
|
RRA@5 | |
| Foundation Model: BERT | |||||||||||
| N/A | 0.5204 | 0.4963 | 0.4992 | 0.4958 | 0.5059 | 0.4934 | 0.4914 | 0.5002 | 0.5037 | 0.4955 | - |
|
|
0.8701 (1) | 0.5496 (4) | 0.5692 (3) | 0.5282 (1) | 0.5103 (3) | 0.5127 (4) | 0.4961 | 0.7291 (3) | 0.5304 (3) | 0.4869 | 0.3833 (1) |
|
|
0.6750 (3) | 0.7118 (1) | 0.5877 (2) | 0.5255 (4) | 0.5128 (2) | 0.4932 | 0.5024 (5) | 0.7184 (4) | 0.5306 (2) | 0.4847 | 0.3533 (3) |
|
|
0.6661 (4) | 0.5841 (3) | 0.8148 (1) | 0.5097 | 0.5093 (5) | 0.5150 (3) | 0.4864 | 0.7393 (1) | 0.5004 | 0.4807 | 0.3117 (4) |
|
|
0.6218 | 0.5081 | 0.5239 | 0.5208 (5) | 0.4992 | 0.4957 | 0.5105 (4) | 0.6220 | 0.5168 | 0.4952 (4) | 0.0700 (8) |
|
|
0.6464 (5) | 0.5053 | 0.5152 | 0.5139 | 0.6185 (1) | 0.5503 (2) | 0.5356 (1) | 0.4794 | 0.5076 | 0.5216 (1) | 0.3700 (2) |
|
|
0.6222 | 0.5153 | 0.5470 | 0.5138 | 0.4989 | 0.4953 | 0.4913 | 0.6171 | 0.5291 (4) | 0.4914 (5) | 0.0450 (10) |
|
|
0.6872 (2) | 0.6203 (2) | 0.5554 (5) | 0.5165 | 0.5051 | 0.5069 | 0.4987 | 0.7311 (2) | 0.5218 | 0.4900 | 0.1700 (7) |
|
|
0.6191 | 0.5396 (5) | 0.5328 | 0.5080 | 0.5097 (4) | 0.5656 (1) | 0.5117 (3) | 0.6954 (5) | 0.5255 (5) | 0.5077 (2) | 0.2683 (5) |
|
|
0.6108 | 0.5108 | 0.5295 | 0.5264 (2) | 0.5089 | 0.5119 (5) | 0.5155 (2) | 0.6191 | 0.5313 (1) | 0.4957 (3) | 0.2533 (6) |
|
|
0.6279 | 0.5127 | 0.5645 (4) | 0.5263 (3) | 0.5023 | 0.4855 | 0.4773 | 0.6699 | 0.5231 | 0.4769 | 0.0583 (9) |
| Foundation Model: Llama-3 | |||||||||||
| N/A | 0.5267 | 0.4904 | 0.6412 | 0.5577 | 0.5191 | 0.7690 | 0.5136 | 0.5454 | 0.5223 | 0.5342 | - |
|
|
0.8606 (1) | 0.5693 (5) | 0.6758 (2) | 0.6116 (2) | 0.5268 (5) | 0.6818 (4) | 0.4911 | 0.9193 (2) | 0.5469 (4) | 0.5116 | 0.3400 (2) |
|
|
0.6599 | 0.7120 (1) | 0.6104 (5) | 0.5279 | 0.5176 | 0.5235 | 0.4808 | 0.8033 | 0.5284 | 0.5008 | 0.1200 (9) |
|
|
0.7504 (3) | 0.6331 (3) | 0.8295 (1) | 0.5829 | 0.5239 | 0.6703 | 0.5165 | 0.8953 (4) | 0.5250 | 0.4900 | 0.1917 (7) |
|
|
0.6592 | 0.5249 | 0.5731 | 0.6799 (1) | 0.5334 (4) | 0.6550 | 0.5301 (3) | 0.8795 (5) | 0.6051 (3) | 0.5320 (4) | 0.2367 (5) |
|
|
0.6262 | 0.5019 | 0.5385 | 0.5840 (5) | 0.6267 (1) | 0.7201 (2) | 0.5939 (1) | 0.6427 | 0.6410 (1) | 0.6057 (3) | 0.4033 (1) |
|
|
0.5922 | 0.4912 | 0.5175 | 0.5068 | 0.5011 | 0.5354 | 0.5147 (4) | 0.5684 | 0.4896 | 0.5142 (5) | 0.0450 (10) |
|
|
0.7517 (2) | 0.6823 (2) | 0.6675 (3) | 0.5670 | 0.5191 | 0.6311 | 0.5134 (5) | 0.8979 (3) | 0.5353 | 0.4860 | 0.1867 (8) |
|
|
0.6655 (5) | 0.4844 | 0.5634 | 0.5694 | 0.5355 (3) | 0.7422 (1) | 0.4958 | 0.8547 | 0.5947 (3) | 0.6104 (2) | 0.2367 (5) |
|
|
0.5750 | 0.5040 | 0.5322 | 0.5876 (3) | 0.5600 (2) | 0.6935 (3) | 0.5727 (2) | 0.6373 | 0.6267 (2) | 0.6149 (1) | 0.2833 (3) |
|
|
0.7444 (4) | 0.5807 (4) | 0.6567 (4) | 0.5871 (4) | 0.5045 | 0.6736 (5) | 0.4818 | 0.9363 (1) | 0.5414 (5) | 0.5125 | 0.2400 (4) |
| Model |
|
|
|
|
|
|
|
|
|
|
| BERT | 0.5204 | 0.4963 | 0.4992 | 0.4958 | 0.5059 | 0.4934 | 0.4914 | 0.5002 | 0.5037 | 0.4955 |
| 0.6607 | 0.6004 | 0.5711 | 0.5213 | 0.5092 | 0.5675 | 0.5218 | 0.6189 | 0.5091 | 0.4795 | |
| 0.8569 | 0.7004 | 0.8019 | 0.5648 | 0.5653 | 0.5106 | 0.5093 | 0.6817 | 0.5039 | 0.4869 | |
| OPT | 0.5650 | 0.5338 | 0.5236 | 0.5042 | 0.4994 | 0.5174 | 0.5026 | 0.3825 | 0.5205 | 0.5026 |
| 0.7002 | 0.5996 | 0.6165 | 0.5189 | 0.5181 | 0.6156 | 0.4853 | 0.7665 | 0.5446 | 0.5037 | |
| 0.8658 | 0.7259 | 0.8132 | 0.5374 | 0.6220 | 0.5813 | 0.5014 | 0.6959 | 0.5248 | 0.4951 | |
| Llama-3 | 0.7690 | 0.4904 | 0.6412 | 0.5577 | 0.5191 | 0.5267 | 0.5136 | 0.5454 | 0.5223 | 0.5342 |
| 0.7295 | 0.6732 | 0.6223 | 0.5864 | 0.5626 | 0.7203 | 0.5764 | 0.7828 | 0.6296 | 0.5806 | |
| 0.8524 | 0.7206 | 0.8235 | 0.6660 | 0.6281 | 0.6683 | 0.5846 | 0.7042 | 0.6139 | 0.5318 |
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5922 | 0.7287 | 0.6722 | 0.5824 | 0.7496 (4) | 0.4912 | 0.6687 (5) | 0.5221 | 0.4863 | 0.5901 |
|
|
0.6827 | 0.7517 (2) | 0.7338 | 0.6036 | 0.7355 | 0.6636 | 0.6823 (1) | 0.6186 | 0.5270 | 0.6482 |
|
|
0.7099 | 0.7439 | 0.6655 | 0.6442 | 0.7502 (3) | 0.5530 | 0.6757 (3) | 0.4844 | 0.4992 | 0.5866 |
|
|
0.7356 | 0.7450 (5) | 0.6923 | 0.5750 | 0.7444 | 0.5248 | 0.6759 (2) | 0.5223 | 0.5040 | 0.5785 |
|
|
0.7367 | 0.7592 (1) | 0.7143 | 0.6096 | 0.7444 | 0.5440 | 0.6738 (4) | 0.5039 | 0.5149 | 0.5807 |
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5175 | 0.6443 | 0.5753 | 0.5304 | 0.6495 | 0.5068 | 0.5659 | 0.5631 | 0.5700 | 0.5858 |
|
|
0.6262 | 0.6675 | 0.6157 | 0.5508 | 0.6439 | 0.5632 | 0.5670 | 0.5810 | 0.6024 (1) | 0.5375 |
|
|
0.6222 | 0.6693 (2) | 0.5634 | 0.5650 | 0.6688 (3) | 0.5690 | 0.5721 | 0.5694 | 0.5973 (2) | 0.5743 |
|
|
0.6328 | 0.6662 (4) | 0.5784 | 0.5322 | 0.6462 | 0.5822 | 0.5947 (5) | 0.5950 (4) | 0.5876 | 0.5864 |
|
|
0.6363 | 0.6779 (1) | 0.5875 | 0.5338 | 0.6567 (5) | 0.5540 | 0.5728 | 0.5748 | 0.5953 (3) | 0.5871 |
|
|
Overall | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5011 | 0.5225 | 0.5254 | 0.5517 | 0.4947 | 0.5218 | 0.6260 (5) | 0.5716 | 0.5442 | 0.6139 |
|
|
0.5140 | 0.5191 | 0.5366 | 0.5551 (5) | 0.5023 | 0.6099 | 0.6375 (4) | 0.6171 | 0.5678 | 0.6135 |
|
|
0.5237 | 0.5293 | 0.5355 | 0.5602 (2) | 0.5098 | 0.5956 | 0.6381 (3) | 0.5636 | 0.5732 | 0.6179 |
|
|
0.5563 (4) | 0.5340 | 0.5387 | 0.5600 (3) | 0.5039 | 0.6063 | 0.6432 (1) | 0.5853 | 0.5518 | 0.6119 |
|
|
0.5098 | 0.5263 | 0.5259 | 0.5654 (1) | 0.5045 | 0.5962 | 0.6420 (2) | 0.5813 | 0.5638 | 0.6147 |
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5354 | 0.6216 | 0.7460 (3) | 0.6649 | 0.6702 | 0.5147 | 0.4937 | 0.5192 | 0.5660 (5) | 0.4927 |
|
|
0.5890 | 0.6311 | 0.7486 (2) | 0.6929 | 0.5829 | 0.5295 | 0.5134 | 0.5186 | 0.5859 (1) | 0.5077 |
|
|
0.7054 | 0.7012 | 0.7422 (5) | 0.7144 | 0.7110 | 0.5130 | 0.5138 | 0.4958 | 0.5725 (4) | 0.4825 |
|
|
0.7061 | 0.6832 | 0.7458 (4) | 0.6935 | 0.6843 | 0.5597 | 0.5249 | 0.5062 | 0.5727 (3) | 0.5089 |
|
|
0.6406 | 0.6688 | 0.7487 (1) | 0.6737 | 0.6736 | 0.5133 | 0.5218 | 0.4950 | 0.5754 (2) | 0.4818 |
|
|
|
|||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5684 | 0.8902 | 0.8679 | 0.6110 | 0.9299 (3) | 0.4896 | 0.5255 | 0.5720 | 0.6274 (4) | 0.5340 |
|
|
0.8452 | 0.8979 | 0.8781 | 0.7137 | 0.8827 | 0.5426 | 0.5353 | 0.5885 | 0.6380 (2) | 0.5211 |
|
|
0.8826 | 0.9076 | 0.8547 | 0.7431 | 0.9297 (4) | 0.5559 | 0.5468 | 0.5947 | 0.6419 (1) | 0.5462 |
|
|
0.7833 | 0.9043 | 0.8674 | 0.6373 | 0.9307 (2) | 0.5741 | 0.5564 | 0.6078 | 0.6267 (5) | 0.5506 |
|
|
0.8633 | 0.9191 (5) | 0.8899 | 0.6743 | 0.9363 (1) | 0.5445 | 0.5290 | 0.5918 | 0.6294 (3) | 0.5414 |
|
|
Overall | |||||||||
|
|
|
|
|
|
|
|
|
|
|
|
|
|
0.5142 | 0.4903 | 0.5947 | 0.6004 | 0.4990 | 0.5245 | 0.6043 | 0.6600 | 0.6139 | 0.6252 |
|
|
0.5006 | 0.4860 | 0.5892 | 0.6156 (2) | 0.5040 | 0.6092 | 0.6127 | 0.6646 (2) | 0.6492 | 0.5997 |
|
|
0.5399 | 0.5188 | 0.6104 (4) | 0.6204 (1) | 0.5189 | 0.6394 | 0.6376 | 0.6596 (4) | 0.6585 | 0.6377 |
|
|
0.5622 | 0.5033 | 0.6006 | 0.6149 (3) | 0.5216 | 0.6371 | 0.6344 | 0.6656 (1) | 0.6290 | 0.6392 (5) |
|
|
0.4921 | 0.4875 | 0.5903 | 0.6055 (5) | 0.5125 | 0.6108 | 0.6252 | 0.6631 (3) | 0.6317 | 0.6291 |
| First Step | Second Step | AUC | nDCG@1 | nDCG@5 | MRR | Recall@1 | Recall@5 |
|
|
|||||||
|
|
|
0.6687 | 0.5179 | 0.5767 | 0.5687 | 0.1865 | 0.5808 |
|
|
|
0.6636 | 0.4905 | 0.5653 | 0.5513 | 0.1653 | 0.5643 |
|
|
|
0.5221 | 0.3234 | 0.4072 | 0.4333 | 0.1100 | 0.4249 |
|
|
|
0.5530 | 0.3461 | 0.4401 | 0.4607 | 0.1232 | 0.4711 |
|
|
|
0.4863 | 0.2675 | 0.3636 | 0.4009 | 0.0866 | 0.3944 |
|
|
|
0.5248 | 0.3154 | 0.4090 | 0.4330 | 0.1001 | 0.4365 |
|
|
|
0.5901 | 0.4206 | 0.4918 | 0.5019 | 0.1514 | 0.5071 |
|
|
|
0.5440 | 0.3506 | 0.4377 | 0.4562 | 0.1170 | 0.4644 |
|
|
|
0.5270 | 0.2999 | 0.3987 | 0.4309 | 0.1042 | 0.4360 |
|
|
|
0.6759 | 0.5342 | 0.5811 | 0.5752 | 0.1903 | 0.5777 |
|
|
|
0.6186 | 0.4205 | 0.5067 | 0.5110 | 0.1451 | 0.5228 |
|
|
|
0.6757 | 0.5208 | 0.5813 | 0.5743 | 0.1847 | 0.5886 |
|
|
|
0.6482 | 0.4630 | 0.5440 | 0.5319 | 0.1539 | 0.5597 |
|
|
|
0.6738 | 0.5182 | 0.5777 | 0.5719 | 0.1813 | 0.5814 |
|
|
|
0.5223 | 0.2990 | 0.4026 | 0.4289 | 0.0973 | 0.4326 |
|
|
|
0.4992 | 0.2688 | 0.3774 | 0.4150 | 0.0904 | 0.4128 |
|
|
|
0.5785 | 0.3932 | 0.4759 | 0.4878 | 0.1373 | 0.5025 |
|
|
|
0.5149 | 0.3005 | 0.3941 | 0.4294 | 0.1057 | 0.4276 |
|
|
|
0.5866 | 0.4026 | 0.4821 | 0.4958 | 0.1428 | 0.5032 |
|
|
|
0.5039 | 0.3012 | 0.3857 | 0.4249 | 0.1013 | 0.4178 |
|
|
|||||||
|
|
|
0.6443 | 0.6782 | 0.6782 | 0.7648 | 0.3301 | 1.0000 |
|
|
|
0.6262 | 0.6638 | 0.8450 | 0.7433 | 0.3107 | 1.0000 |
|
|
|
0.5753 | 0.5945 | 0.8201 | 0.7204 | 0.2873 | 1.0000 |
|
|
|
0.6222 | 0.6665 | 0.8441 | 0.7554 | 0.3292 | 1.0000 |
|
|
|
0.5304 | 0.5257 | 0.7972 | 0.6889 | 0.2511 | 1.0000 |
|
|
|
0.6328 | 0.6635 | 0.8463 | 0.7607 | 0.3276 | 1.0000 |
|
|
|
0.6495 | 0.6867 | 0.6867 | 0.7749 | 0.3423 | 1.0000 |
|
|
|
0.6363 | 0.6705 | 0.8484 | 0.7622 | 0.3307 | 1.0000 |
|
|
|
0.5508 | 0.5653 | 0.8096 | 0.7094 | 0.2780 | 1.0000 |
|
|
|
0.6662 | 0.7094 | 0.8628 | 0.7828 | 0.3496 | 1.0000 |
|
|
|
0.6157 | 0.6423 | 0.8387 | 0.7473 | 0.3137 | 1.0000 |
|
|
|
0.6693 | 0.7028 | 0.8624 | 0.7850 | 0.3483 | 1.0000 |
|
|
|
0.6439 | 0.6742 | 0.8511 | 0.7619 | 0.3258 | 1.0000 |
|
|
|
0.6779 | 0.7162 | 0.8671 | 0.7921 | 0.3550 | 1.0000 |
|
|
|
0.5784 | 0.5939 | 0.8211 | 0.7248 | 0.2913 | 1.0000 |
|
|
|
0.5650 | 0.5841 | 0.8161 | 0.7192 | 0.2868 | 1.0000 |
|
|
|
0.6462 | 0.6879 | 0.8540 | 0.7722 | 0.3413 | 1.0000 |
|
|
|
0.5338 | 0.5435 | 0.8011 | 0.6999 | 0.2687 | 1.0000 |
|
|
|
0.6688 | 0.7104 | 0.8636 | 0.7873 | 0.3534 | 1.0000 |
|
|
|
0.5875 | 0.6080 | 0.8251 | 0.7334 | 0.3019 | 1.0000 |
| First Step | Second Step | AUC | nDCG@1 | nDCG@5 | MRR | Recall@1 | Recall@5 |
|
|
|||||||
|
|
|
0.6216 | 0.4486 | 0.5870 | 0.5467 | 0.2146 | 0.7081 |
|
|
|
0.5890 | 0.3970 | 0.5600 | 0.5149 | 0.1873 | 0.6875 |
|
|
|
0.7460 | 0.5643 | 0.7022 | 0.6516 | 0.2757 | 0.8207 |
|
|
|
0.7054 | 0.5360 | 0.6671 | 0.6224 | 0.2648 | 0.7896 |
|
|
|
0.6649 | 0.4753 | 0.6223 | 0.5757 | 0.2297 | 0.7523 |
|
|
|
0.7061 | 0.5277 | 0.6685 | 0.6181 | 0.2592 | 0.7952 |
|
|
|
0.6702 | 0.5270 | 0.6434 | 0.6072 | 0.2610 | 0.7577 |
|
|
|
0.6406 | 0.4749 | 0.6085 | 0.5677 | 0.2337 | 0.7323 |
|
|
|
0.6929 | 0.5158 | 0.6487 | 0.6072 | 0.2552 | 0.7728 |
|
|
|
0.6832 | 0.5330 | 0.6531 | 0.6079 | 0.2594 | 0.7668 |
|
|
|
0.7486 | 0.5926 | 0.7075 | 0.6648 | 0.2934 | 0.8187 |
|
|
|
0.7012 | 0.5546 | 0.6680 | 0.6262 | 0.2732 | 0.7790 |
|
|
|
0.5829 | 0.3782 | 0.5432 | 0.4812 | 0.1570 | 0.6479 |
|
|
|
0.6688 | 0.5217 | 0.6417 | 0.6024 | 0.2559 | 0.7530 |
|
|
|
0.7458 | 0.5815 | 0.7041 | 0.6576 | 0.2861 | 0.8183 |
|
|
|
0.7144 | 0.5544 | 0.6732 | 0.6279 | 0.2723 | 0.7901 |
|
|
|
0.6843 | 0.5417 | 0.6553 | 0.6196 | 0.2690 | 0.7688 |
|
|
|
0.6737 | 0.4782 | 0.6307 | 0.5871 | 0.2357 | 0.7659 |
|
|
|
0.7110 | 0.5660 | 0.6785 | 0.6417 | 0.2823 | 0.7916 |
|
|
|
0.7487 | 0.5902 | 0.7092 | 0.6657 | 0.2934 | 0.8239 |
|
|
|||||||
|
|
|
0.4937 | 0.4761 | 0.7172 | 0.6378 | 0.2227 | 0.8871 |
|
|
|
0.5295 | 0.5086 | 0.7383 | 0.6575 | 0.2385 | 0.9011 |
|
|
|
0.5192 | 0.5164 | 0.7356 | 0.6576 | 0.2461 | 0.8985 |
|
|
|
0.5130 | 0.4964 | 0.7289 | 0.6572 | 0.2445 | 0.8998 |
|
|
|
0.5660 | 0.5535 | 0.7599 | 0.6842 | 0.2684 | 0.9166 |
|
|
|
0.5597 | 0.5421 | 0.7553 | 0.6813 | 0.2645 | 0.9152 |
|
|
|
0.4927 | 0.4765 | 0.7168 | 0.6465 | 0.2360 | 0.8917 |
|
|
|
0.5133 | 0.5043 | 0.7281 | 0.6559 | 0.2451 | 0.8938 |
|
|
|
0.5859 | 0.5732 | 0.7695 | 0.6981 | 0.2815 | 0.9208 |
|
|
|
0.5249 | 0.5097 | 0.7358 | 0.6605 | 0.2461 | 0.9022 |
|
|
|
0.5186 | 0.5078 | 0.7340 | 0.6587 | 0.2476 | 0.9009 |
|
|
|
0.5138 | 0.5029 | 0.7303 | 0.6552 | 0.2437 | 0.9002 |
|
|
|
0.5077 | 0.4995 | 0.7262 | 0.6482 | 0.2365 | 0.8930 |
|
|
|
0.5218 | 0.5019 | 0.7321 | 0.6595 | 0.2441 | 0.9003 |
|
|
|
0.5062 | 0.5105 | 0.7294 | 0.6543 | 0.2467 | 0.8953 |
|
|
|
0.5725 | 0.5635 | 0.7640 | 0.6909 | 0.2752 | 0.9175 |
|
|
|
0.5089 | 0.4963 | 0.7266 | 0.6568 | 0.2453 | 0.8972 |
|
|
|
0.5754 | 0.5628 | 0.7657 | 0.6939 | 0.2770 | 0.9220 |
|
|
|
0.4825 | 0.4684 | 0.7091 | 0.6412 | 0.2320 | 0.8850 |
|
|
|
0.4950 | 0.4885 | 0.7211 | 0.6491 | 0.2412 | 0.8955 |
| First Step | Second Step | AUC | nDCG@1 | nDCG@5 | MRR | Recall@1 | Recall@5 |
|
|
|||||||
|
|
|
0.5659 | 0.2801 | 0.4068 | 0.4011 | 0.1319 | 0.5098 |
|
|
|
0.5632 | 0.2638 | 0.4017 | 0.3902 | 0.1224 | 0.5006 |
|
|
|
0.5631 | 0.2773 | 0.4025 | 0.3992 | 0.1322 | 0.5060 |
|
|
|
0.5690 | 0.2705 | 0.4091 | 0.4026 | 0.1316 | 0.5236 |
|
|
|
0.5700 | 0.2913 | 0.4137 | 0.4081 | 0.1391 | 0.5173 |
|
|
|
0.5822 | 0.3072 | 0.4315 | 0.4214 | 0.1483 | 0.5418 |
|
|
|
0.5858 | 0.2866 | 0.4311 | 0.4247 | 0.1420 | 0.5541 |
|
|
|
0.5540 | 0.2752 | 0.3944 | 0.3972 | 0.1339 | 0.4980 |
|
|
|
0.6024 | 0.3263 | 0.4463 | 0.4380 | 0.1604 | 0.5559 |
|
|
|
0.5947 | 0.3389 | 0.4488 | 0.4387 | 0.1619 | 0.5490 |
|
|
|
0.5810 | 0.3013 | 0.4289 | 0.4209 | 0.1475 | 0.5383 |
|
|
|
0.5721 | 0.2969 | 0.4172 | 0.4155 | 0.1434 | 0.5228 |
|
|
|
0.5375 | 0.2373 | 0.3741 | 0.3605 | 0.0994 | 0.4657 |
|
|
|
0.5728 | 0.2925 | 0.4182 | 0.4143 | 0.1408 | 0.5236 |
|
|
|
0.5950 | 0.3267 | 0.4455 | 0.4354 | 0.1596 | 0.5504 |
|
|
|
0.5973 | 0.3559 | 0.4529 | 0.4440 | 0.1743 | 0.5487 |
|
|
|
0.5864 | 0.3162 | 0.4433 | 0.4342 | 0.1567 | 0.5616 |
|
|
|
0.5953 | 0.3181 | 0.4452 | 0.4346 | 0.1552 | 0.5588 |
|
|
|
0.5743 | 0.2949 | 0.4227 | 0.4212 | 0.1463 | 0.5397 |
|
|
|
0.5748 | 0.3099 | 0.4234 | 0.4224 | 0.1526 | 0.5294 |
|
|
|||||||
|
|
|
0.5225 | 0.5763 | 0.7969 | 0.7180 | 0.2760 | 0.9523 |
|
|
|
0.5140 | 0.5798 | 0.7964 | 0.7105 | 0.2706 | 0.9515 |
|
|
|
0.5254 | 0.5911 | 0.8011 | 0.7250 | 0.2865 | 0.9537 |
|
|
|
0.5237 | 0.5805 | 0.7978 | 0.7238 | 0.2861 | 0.9544 |
|
|
|
0.5517 | 0.6209 | 0.8137 | 0.7394 | 0.3027 | 0.9584 |
|
|
|
0.5563 | 0.6182 | 0.8126 | 0.7418 | 0.3030 | 0.9557 |
|
|
|
0.4947 | 0.5583 | 0.7864 | 0.7129 | 0.2763 | 0.9495 |
|
|
|
0.5098 | 0.5785 | 0.7941 | 0.7184 | 0.2840 | 0.9515 |
|
|
|
0.5551 | 0.6242 | 0.8156 | 0.7443 | 0.3078 | 0.9591 |
|
|
|
0.5340 | 0.5878 | 0.8029 | 0.7276 | 0.2854 | 0.9553 |
|
|
|
0.5366 | 0.5969 | 0.8038 | 0.7326 | 0.2941 | 0.9534 |
|
|
|
0.5293 | 0.5829 | 0.8001 | 0.7259 | 0.2845 | 0.9548 |
|
|
|
0.5023 | 0.5619 | 0.7910 | 0.7171 | 0.2756 | 0.9511 |
|
|
|
0.5263 | 0.5866 | 0.7974 | 0.7258 | 0.2879 | 0.9497 |
|
|
|
0.5387 | 0.6029 | 0.8064 | 0.7337 | 0.2948 | 0.9555 |
|
|
|
0.5602 | 0.6296 | 0.8170 | 0.7453 | 0.3100 | 0.9590 |
|
|
|
0.5039 | 0.5656 | 0.7886 | 0.7174 | 0.2809 | 0.9477 |
|
|
|
0.5654 | 0.6278 | 0.8190 | 0.7492 | 0.3114 | 0.9625 |
|
|
|
0.5098 | 0.5785 | 0.7941 | 0.7184 | 0.2840 | 0.9515 |
|
|
|
0.5259 | 0.5927 | 0.8001 | 0.7298 | 0.2942 | 0.9522 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Advanced Graph Neural Networks
Evaluating Recabilities of Foundation Models:
A Multi-Domain, Multi-Dataset Benchmark
Qijiong Liu1, Jieming Zhu2, Yingxin Lai3, Xiaoyu Dong1,
Lu Fan1, Zhipeng Bian4, Zhenhua Dong2, **Xiao-Ming Wu1
1**The Hong Kong Polytechnic University 2Huawei Noah’s Ark Lab, Shenzhen, China
3Xiamen University, Xiamen, China 4Shenzhen University, Shenzhen, China
Abstract
Comprehensive evaluation of the recommendation capabilities of existing foundation models across diverse datasets and domains is essential for advancing the development of recommendation foundation models. In this study, we introduce RecBench-MD, a novel and comprehensive benchmark designed to assess the recommendation abilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective. Through extensive evaluations of 19 foundation models across 15 datasets spanning 10 diverse domains—including e-commerce, entertainment, and social media—we identify key characteristics of these models in recommendation tasks. Our findings suggest that in-domain fine-tuning achieves optimal performance, while cross-dataset transfer learning provides effective practical support for new recommendation scenarios. Additionally, we observe that multi-domain training significantly enhances the adaptability of foundation models. All code111https://github.com/Jyonn/RecBench-MD and data222https://www.kaggle.com/datasets/qijiong/recbench-md have been publicly released to facilitate future research.
1 Introduction
The rapid emergence of foundation models, particularly large language models (LLMs), has revolutionized various fields such as natural language processing (NLP) (Touvron et al., 2023a; Reid et al., 2024) and computer vision (Kirillov et al., 2023; Li et al., 2024). Recently, their application in recommender systems has attracted considerable interest, as these models promise a unified framework capable of modeling user–item interactions through natural language (Wu et al., 2024a; Zhao et al., 2024; Bao et al., 2023b). Despite the existence of numerous foundation models, most are primarily designed for NLP tasks, and there is currently a lack of effective strategies for selecting appropriate models to develop recommendation foundation models. Consequently, assessing the recommendation abilities, referred to as Recabilities, of foundation models has become increasingly important.
Recommendation foundation models, akin to LLMs with general NLP capabilities, should exhibit broad zero-resource 333In this paper, zero-resource means fine-tuning on some datasets and testing on unseen ones (cross-dataset), while zero-shot means testing without any fine-tuning. Recabilities, allowing for inference on unseen datasets or even novel domains. This necessitates a comprehensive evaluation of the recommendation abilities of existing foundation models across various datasets, domains, training and evaluation strategies, and recommendation tasks and approaches. Although efforts such as LLMRec (Liu et al., 2023) and PromptRec (Wu et al., 2024b) exist, these studies primarily concentrate on a single domain or dataset using one recommendation approach, leading to a constrained evaluation scope and partial conclusions.
To address these challenges, we introduce a multi-domain recommendation taxonomy that examines all applicable scenarios across eight settings, as depicted in Figure 1, ranging from to . Initially, recommendation models were developed for individual datasets, corresponding to (single-domain single-dataset). Subsequently, researchers explored cross-domain recommendation models, represented by (cross-domain cross-dataset), which aim to transfer user interest knowledge from a source domain to a target domain. Additionally, some studies have investigated the integration of multiple domains to train a unified model, corresponding to (multi-domain multi-dataset), transitioning from a one-dataset-one-model paradigm to a multiple-dataset-one-model paradigm. More recently, several studies have assessed the zero-shot Recabilities of foundation models. However, these evaluations are often limited to a single dataset (), which can lead to unreliable and biased results. A more robust approach is to evaluate across multiple datasets and compute the average, as illustrated in , i.e., zero-shot multi-domain.
In this work, we present a comprehensive benchmark, RecBench-MD, specifically designed to evaluate the Recabilities of foundation models from a zero-resource, multi-dataset, and multi-domain perspective, encompassing all settings illustrated in Figure 1. This study is pioneering in its benchmarking of cross-dataset recommendation for zero-resource settings. We have specifically examined a range of recommendation approaches, including prompt-based ranking tasks and embedding-based matching tasks, thereby covering the main recommendation scenarios. Our evaluation is unprecedented in scope, encompassing 15 recommendation datasets across 10 domains and 19 foundation models. Furthermore, we provide open-source code and datasets, facilitating easy evaluation for future large-scale recommendation or foundation models with simple configuration. Our experiments required an impressive 1,000 GPU hours, and the platform’s reusability significantly reduces experimental costs for future researchers, allowing them to concentrate more on model optimization and algorithm innovation.
Our benchmarking results reveals several key insights. First, larger models tend to benefit more from joint training on multiple datasets or domains, exhibiting stronger cross-domain generalization. Second, the degree of transferability across domains varies considerably, with a strong dependence on the characteristics of the source dataset. Third, while in-domain datasets exhibit higher relevance, this is not universally observed across all scenarios. Fourth, cross-dataset transfer can serve as an effective model warm-up strategy in novel recommendation contexts, though it is challenging to exceed the performance upper bound established by fine-tuning on single or multiple datasets within the target domain.
2 Related Work
Existing Benchmarks. Several benchmarks have been proposed to evaluate the Recabilities for foundation models, including LLMRec (Liu et al., 2023), PromptRec (Wu et al., 2024b), and others (Zhang et al., 2021; Jiang et al., 2024; Liu et al., 2024a). However, as illustrated in Table 1, these benchmarks i) provide only a limited evaluation of recommendation settings, often focusing on a single approach. In addition, ii) the number of foundation models and datasets evaluated remains relatively small, resulting in an incomplete and fragmented performance landscape in this domain.
Multi-domain Recommendation. Traditional multi-domain recommendation methods predominantly rely on item-based or user-based knowledge transfer, using common items or shared user interactions to mitigate data sparsity and domain discrepancies (Guo et al., 2021; Chen et al., 2022, 2019). However, such approaches require explicit entity-level overlap between domains–a condition rarely met in real-world scenarios (Zhu et al., 2021; Zang et al., 2022). In contrast, text-based knowledge transfer leverages rich entity-side information, such as item descriptions and user profiles, used in diverse features (Chen et al., 2013; Gao et al., 2013). Borrowing the semantic comprehension and generation capabilities of foundation models, text-based methods boost cross-domain learning without the need for explicit entity alignment, i.e., non-overlap for users and items, thereby offering a more flexible and robust framework for transferring knowledge across heterogeneous domains.
Foundation Models for Recommendation. In recent years, integrating large language models (LLMs) into recommender systems has attracted significant academic and industrial interest. These integrations can be broadly classified into two paradigms (Wu et al., 2024a; Zhao et al., 2024; Bao et al., 2023b; Chen et al., 2024): LLM-for-RS and LLM-as-RS. The LLM-for-RS paradigm enhances traditional recommenders via feature engineering or encoding techniques using LLMs (Wei et al., 2024; Liu et al., 2024b, c, 2025a; Wu et al., 2023; Zhou et al., 2025; Hu et al., 2024). In contrast, the LLM-as-RS paradigm employs LLMs directly as recommenders (Ngo and Nguyen, 2024; Li et al., 2023; Geng et al., 2022; Liu et al., 2024d). Studies have demonstrated its superior accuracy in contexts such as cold-start scenarios (Bao et al., 2023a) and in tasks requiring natural language understanding and generation (Luo et al., 2023; Wang et al., 2023; He et al., 2023).
3 Proposed Benchmark: RecBench-MD
3.1 Recommendation Settings
In bottom-level text-based knowledge transfer, we can freely collect training data as long as: (i) each item is described by textual content, and (ii) each user is represented by their item consumption sequence. To systematically explore how cross-domain data influences target-domain recommendation performance, we propose a novel taxonomy comprising eight fine-tuning settings, as illustrated in Figure 1:
****** (Zero-resource) Zero-shot Single-dataset.** The model is directly evaluated on a single dataset without any fine-tuning. This setting measures the model’s intrinsic Recability.
****** (Zero-resource) Zero-shot Multi-domain.** A more comprehensive zero-shot evaluation: the model is tested on multiple datasets from different domains, and performance is averaged to assess generalization.
****** Single-domain Single-dataset.** Fine-tuning and evaluation are performed on the same dataset. This setting reflects standard in-domain supervised learning.
****** (Zero-resource) Single-domain Cross-dataset.** The model is fine-tuned on one or more datasets within a domain and evaluated on a different dataset from the same domain. It assesses domain-level generalization across datasets.
****** Single-domain Multi-dataset.** Training and testing data are drawn from multiple datasets within the same domain, with potential overlap. This setting measures the benefit of aggregating in-domain data.
****** (Zero-resource) Cross-domain Cross-dataset.** The model is fine-tuned on one domain and tested on a completely different one. This setting probes cross-domain transferability of recommendation knowledge.
****** (Zero-resource) Multi-domain Cross-dataset.** Training and testing datasets come from overlapping but non-identical domains. This setting evaluates how auxiliary domain knowledge contributes to target performance when datasets do not overlap.
****** Multi-domain Multi-dataset.** Both domains and datasets overlap between training and testing. This setting examines the upper-bound performance achievable via comprehensive domain and dataset fusion.
3.2 Recommendation Approaches
We evaluate Recabilities of foundation models with the pair-wise user–item click prediction task. It involves the estimation of the probability that a user will interact positively with a candidate item. Therefore, the models will be trained by the binary cross-entropy (BCE) loss, formulated as:
[TABLE]
where denotes the ground-truth label. Borrowing the idea from conventional recommendation, including matching-based and ranking-based models, we devise two recommendation approaches to calculate the click probabilities.
Prompt-based Recommendation. We concatenate the user sequence with the candidate item where each item in the sequence and the candidate item are represented by their textual feature. Then, the entire user-item sequence will be in conjunction with a task-specific instruction (e.g., "Will the user be interested in this item? Answer (Yes or No):"). Next, the model is guided to predict specific output tokens (i.e., “Yes” or “No”), and their corresponding logits, and . Finally, the click probability can be denoted as:
[TABLE]
Embedding-based Recommendation. Following matching-based two-tower paradigm, here the foundation models are employed as user and item encoders learn their dense representations (embeddings) within a shared latent space. Specifically, we use the last token output embedding for user/item representation when the input is the user sequence or the candidate item. The click probability can be subsequently measured by the cosine similarity:
[TABLE]
where denotes the dot product operation, represents the L2 norm, and and are user and item representations.
4 Experimental Setup
Datasets. To meaningfully probe foundation model capabilities in recommendation beyond prevalent single-dataset evaluations, a deliberately heterogeneous suite of 15 public datasets across 10 domains was assembled. This collection’s scale and diversity (Table 2) are necessary to stress-test the central premise of foundation model generalization across varied recommendation contexts, spanning high-volume consumer arenas (e.g., fashion, news) to specialized niches (e.g., games, hotels). The inherent heterogeneity manifesting in item taxonomies, user interaction dynamics, textual signal richness, and sparsity levels is instrumental, leveraged to transcend potentially idiosyncratic single-domain observations and evaluate genuine cross-domain. For each dataset, the fine-tuning set is randomly split into a training set and validation set in a 9:1 ratio.
Foundation Models. We collected 19 foundation models from different perspectives to evaluate their Recabilities, including: BERT (Kenton and Toutanova, 2019), OPT (Zhang et al., 2022), OPT (Zhang et al., 2022), Llama-1 (Touvron et al., 2023a), Llama-2 (Touvron et al., 2023b), Llama-3 (Dubey et al., 2024), Llama-3.1 (Meta AI, 2024), GPT-3.5 (OpenAI, 2023), Qwen-2 (Yang et al., 2024), Qwen-2 (Yang et al., 2024), Qwen-2 (Yang et al., 2024), GLM-4 (GLM et al., 2024), Misrtal-2 (Jiang et al., 2023), DS-Qwen-2 (Bi et al., 2024), E5 (Wang et al., 2022), Phi-2 (Javaheripi et al., 2023), RecGPT (Ngo and Nguyen, 2024), P5 (Geng et al., 2022), and Recformer (Li et al., 2023). We present a comprehensive comparison across multiple dimensions: varying model sizes within the same organization (e.g., Qwen-2 series), different versions from the same organization (e.g., Llama series), models of similar size released in the same year by different organizations (e.g., THU’s GLM-4, Meta’s Llama-3, and Alibaba’s Qwen-2 in 2024), and models targeting different domains (e.g., the general foundation model Llama vs. the recommendation foundation model RecGPT). Specifically, the closed-source GPT-3.5 model from OpenAI supports only the prompt-based recommendation paradigm due to the unavailability of item and user embeddings. In contrast, models like Recformer and E5, designed with a dual-tower architecture, can only be evaluated with the embedding-based paradigm.
Evaluation Protocols. Following common practice (Liu et al., 2025b), we evaluate recommendation performance using widely adopted metrics, including ranking metrics such as GAUC, nDCG, and MRR, as well as matching metrics like F1 and Recall. However, due to space limitations, we present only the GAUC (shortly AUC) metric mostly. The full evaluation results will available on our webpage.
Additionally, we also design the Reciprocal Rank Average (RRA) metric to evaluate the contribution for each finetune set (used in Table 4). Specifically, we mark the top-K finetune set for each test set, and calculate the top-K RRA metric by:
[TABLE]
where is the number of the test datasets, is the rank of the model on the -th dataset, is the rank threshold (e.g., ).
Implementation Details. During data preprocessing, we standardized datasets of varying original sizes to comparable scales: the test set contains approximately 20,000 samples, while the fine-tuning set consists of around 100,000 samples. For each dataset, items were carefully curated to retain the most representative textual content features. User behavior sequences were truncated to a maximum length of 20; if a sequence exceeded this limit, only the most recent interactions were preserved.
We fine-tune models using LoRA (Hu et al., 2022) (Low-Rank Adaptation), a parameter-efficient strategy with rank 32 and alpha 128. The learning rate is set to 1e-4 across all experiments, using the Adam optimizer. An effective batch size of 32 is maintained via gradient accumulation, and early stopping is applied with a patience of 2. Models are built and evaluated using the Huggingface Transformers library (Wolf et al., 2019). For BERT, OPT, and Llama-3, the maximum sequence lengths are 512, 1024, and 1024, respectively, with precision set to float32 for BERT and bfloat16 for OPT and Llama-3. To reduce fine-tuning overhead for embedding-based architectures, we freeze lower layers of OPT and Llama-3, applying LoRA only to the top two layers. When fine-tuning on multiple datasets, early stopping is based on the average validation AUC across datasets. We will release the code, data, checkpoints, and documentation at our GitHub repository.
All the experiments are conducted on a single Nvidia A100 GPU device. Except for the zero-shot setting, all results are averaged over five runs, with statistically significant differences observed ().
5 Findings and Discussions
In this section, we present a comprehensive analysis of experimental results evaluating the foundation model Recabilities in diverse fine-tuning regimes and different evaluation tasks.444Due to space limits, more experimental results are provided in the appendix and supplementary material.
5.1 Zero-shot Multi-domain: Prompt-based vs. Embedding-based
Here, we investigate the zero-shot Recabilities of various foundation models. For each dataset, we identify the maximum and minimum AUC values across all evaluated models in both paradigms (with the minimum constrained to 0.5) and normalize the results accordingly, as shown in Figure 2. Based on these findings, we make the following observations:
First, for almost all datasets, the prompt-based evaluation paradigm outperforms the embedding-based one, as it aligns more closely with the pre-training objectives of foundation models.
Second, under the prompt-based paradigm, three LLMs (Misrtal-2, GLM-4, Qwen-2) exhibit superior performance, possibly due to the inclusion of the collaborative signals during pre-training. In contrast, P5 performs well on two Amazon datasets (CDs and Electronics) but less favorably on others, likely because the used checkpoint was trained on the Amazon Beauty dataset, thereby modeling Amazon user interests.
Thirdly, in the embedding-based paradigm, performance differences among models are less pronounced. Notably, smaller models (such as BERT and OPT) perform better under this setting than in the prompt-based paradigm, whereas the embeddings of larger models appear less sensitive to similarity metrics, in line with the findings of (Freestone and Santu, 2024). Additionally, the matching-based language model E5 and recommendation model Recformer also demonstrate strong performance, benefiting from the consistency between the evaluation and training paradigms.
5.2 Single-domain Fine-tuning: vs.
Here, we study the single-domain fine-tuning recommendation scenario. We mainly select three foundation models, i.e., BERT, OPT, and Llama-3, of three distinct model size for the evaluation. From results displayed in Table 3, we can make the following observations:
First, compared to zero-shot baselines (), domain-specific fine-tuning strategies ( and ) consistently achieve superior performance on both prompt-based and embedding-based paradigm. This is primarily because large models have acquired domain-specific collaborative knowledge through fine-tuning.
Second, fine-tuning with a single-domain single-dataset setting () yields more stable performance than the cross-dataset variant (), even within the same domain. This is likely due to optimization conflicts between datasets, as observed on Goodreads and H&M, where underperforms compared to .
Third, large-scale foundation models (e.g., Llama-3) achieve the best performance under the , as their pretraining enables a broad understanding of general textual knowledge across domains, allowing them to effectively extract and generalize useful patterns from auxiliary datasets to the target dataset. In contrast, smaller models such as BERTare less suited for , as they struggle to abstract transferable patterns even from datasets within the same domain, leading to limited performance gains.
5.3 Cross-dataset Fine-tuning: vs.
Here, we study effect of the cross-dataset fine-tuning, including single-domain and cross-domain scenario. The experiments are conducted across two foundation models: BERT and Llama-3. The foundation model will be firstly fine-tuned with one single dataset and then evaluated over 10 test datasets. We design a RRA metric (Equation 4) to evaluate the usefulness of each finetune set.
Based Table 4, we can make the following observations:
First, cross-dataset fine-tuning generally improves recommendation performance, but it may also introduce negative effects on the target dataset in some cases (as indicated by the red-highlighted results in the table). Notably, the Yelp and Hotel. datasets exhibit a higher likelihood of such degradation, possibly due to domain gaps and mismatches between the test sets and finetune sets. Moreover, for the Llama-3 model, Micro. and Movie. also demonstrate performance degradation under cross-dataset finetuning. Interestingly, these two datasets are where Llama-3 achieves the highest zero-shot performance among the 10 test sets. This suggests that Llama-3 likely encountered collaborative signals related to these domains during pretraining, allowing it to effectively capture user interests for video-based recommendations even without additional tuning.
Second, single-domain cross-dataset finetuning is not always more effective than cross-domain finetuning. While it intuitively makes sense that user interests are easier to model within the same domain–supported by results in news ( MIND– PENS) and books ( Good.– Books)–this trend does not hold for movies, music, and fashion: their results did not even rank in the top five. A possible reason is that MIND and PENS both originate from Microsoft, and Amazon is the source of Books as well as the parent company of Goodreads.com, suggesting these dataset pairs may share more similar distributions.
Third, dataset quality varies, but its effectiveness also depends heavily on the capacity of the pretrained model. For instance, finetuning on Good. with BERT () ranks only fifth, while using Llama-3 lifts it to first. This may be because Goodreads relies on book titles as content features, which are poorly represented in smaller models’ pretraining corpora. In contrast, Llama-3 better understands textual content, leading to more robust item representations and improved user modeling. According to the Top-5 RRA results, CDs and H&M offer the strongest transferability, while POG performs the weakest. Additionally, Good. and Last.fm show large performance gains when switching from BERT to Llama-3, suggesting complex content features paired with highly transferable user interests. On the other hand, MIND and Micro. show ranking drops, indicating their simpler content may already be sufficiently modeled by smaller models, but their user behavior patterns are less suitable for cross-dataset transfer. Finally, although Books and Last.fm do not have corresponding test sets, their finetuned models still rank in the top five under Llama-3, suggesting strong generalization capability across domains.
5.4 Multi-domain Fine-tuning: vs.
We investigate the impact of multi-domain fine-tuning, focusing on two key settings: multi-domain cross-dataset () and multi-domain multi-dataset (). In the setting, foundation models are fine-tuned using datasets from domains different from the test sets, specifically: POG, PENS, Netflix, Books, and Last.fm. In contrast, the setting involves fine-tuning on datasets that share domains with the test sets but do not include overlapping data, namely: H&M, MIND, Micro., Good., and CDs. From the results illustrated in Table 5, we can make the following observations:
First, achieves the best performance on H&M, MIND, Micro., Good., and CDs across all three foundation models, as it is fine-tuned directly on these datasets and thus captures domain-specific knowledge effectively. Second, although uses different datasets for fine-tuning, it consistently outperforms the zero-shot setting (), highlighting the generalization benefits of multi-domain training with diverse user behavior patterns. Third, while serves as an upper bound, the performance gap between and narrows with larger models–for instance, the improvement on HM drops from 30.0% (BERT) to 16.8% (Llama-3), suggesting that large models fine-tuned on cross-domain data can better handle zero-resource scenarios. Finally, due to its reliance on specific data distributions, underperforms on five other datasets ( Movie., Yelp, Steam, Elec., Hotel.), underscoring the fairness and robustness of the multi-domain cross-dataset setting.
6 Conclusion
We have introduced RecBench-MD, a novel and comprehensive benchmark designed to evaluate the recommendation capabilities of foundation models across a wide range of datasets and domains. Our thorough analysis of 19 foundation models across 15 datasets and 10 domains provides crucial insights into their performance in recommendation tasks. The findings demonstrate the substantial advantages of cross-dataset transfer learning and multi-domain training in improving the adaptability of foundation models. We expect that these insights, along with the valuable resources provided, will drive future advancements in the development of recommendation foundation models, offering a strong foundation for continued research and innovation in this field.
Appendix A Limitations
In this study, we assess the recommendation capabilities of foundation models on two of the most prevalent tasks: prompt-based approaches (similar to CTR models) and embedding-based approaches (akin to matching models), from a multi-dataset, multi-domain perspective. Nonetheless, our current evaluation does not encompass sequential recommendation, which represents a crucial area for future development and enhancement.
Appendix B Broader Impacts
Our benchmark offers a comprehensive and scalable framework for evaluating foundation models in zero-resource, multi-dataset, multi-domain recommendation scenarios, thereby promoting more systematic and reproducible research. It establishes a solid foundation for ongoing research and innovation in this field. Furthermore, the benchmark facilitates cross-domain fine-tuning, extending its benefits to other areas such as natural language processing.
Appendix C Technical Appendices
C.1 Impact of Fine-tuning Dataset Order In Multi-domain Recommendation
Previously, Table 5 compares three evaluation strategies: , , and . Both and involve training on a mixture of all available fine-tune sets. In contrast, we analyze a sequential fine-tuning strategy here, focusing on how to select datasets and determine fine-tuning order. Given the combinatorial complexity of all five datasets (), we restrict our analysis to pairs of fine-tune sets. Based on the results from Table 6, we can observe that:
First, in most cases, the value at row , column exceeds that of the corresponding single-dataset fine-tuning result in row , column , indicating that using two datasets generally provides greater benefit than using only one. However, it does not necessarily surpass the value at row , column , since knowledge learned from dataset , the first step, may be subject to catastrophic forgetting during continual fine-tuning with dataset .
Second, building on this observation, the dataset used in the later stage (second step) of fine-tuning tends to have a dominant influence on the final performance. For example, in columns corresponding to datasets that achieved the best single-dataset results, green highlights are commonly observed (e.g., the PENS column when the test set is H&M, or the Books column when the test set is Good.). To further investigate this effect, we present a more detailed analysis in Figure 3, showing that datasets with stronger single-dataset performance are generally more effective when used in the second fine-tuning step.
Third, we further investigate which fine-tuning combinations are most likely to yield Top-5 performance. We hypothesize that this is related to the single-dataset performance of the fine-tune sets. To examine this, we present Figure 4. The results suggest that using a lower-ranked dataset in the first step, followed by the top-performing dataset in the second step, tends to produce the best outcomes for the target test set.
C.2 Additional Evaluation Metrics
In the main text, we report only the AUC metric due to the space constraints. Here, we provide additional evaluation metrics, including nDCG@1, nDCG@5, MRR, Recall@1, and Recall@5, for a more comprehensive comparison.
As shown in Table 7, Table 8, and Table 9, other metrics generally align with the AUC results, supporting the consistency of our findings. We will release the complete experimental results on our website.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Bao et al. [2023 a] Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. Tallrec: An effective and efficient tuning framework to align large language model with recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems , pages 1007–1014, 2023 a.
- 2Bao et al. [2023 b] Keqin Bao, Jizhi Zhang, Yang Zhang, Wang Wenjie, Fuli Feng, and Xiangnan He. Large language models for recommendation: Progresses and future directions. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region , pages 306–309, 2023 b.
- 3Bi et al. [2024] Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. ar Xiv preprint ar Xiv:2401.02954 , 2024.
- 4Chen et al. [2022] Chaochao Chen, Huiwen Wu, Jiajie Su, Lingjuan Lyu, Xiaolin Zheng, and Li Wang. Differential private knowledge transfer for privacy-preserving cross-domain recommendation. In Proceedings of the ACM web conference 2022 , pages 1455–1465, 2022.
- 5Chen et al. [2019] Chong Chen, Min Zhang, Chenyang Wang, Weizhi Ma, Minming Li, Yiqun Liu, and Shaoping Ma. An efficient adaptive transfer neural network for social-aware recommendation. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval , pages 225–234, 2019.
- 6Chen et al. [2024] Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web , 27(4):42, 2024.
- 7Chen et al. [2013] Wei Chen, Wynne Hsu, and Mong Li Lee. Making recommendations from multiple domains. In Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 892–900, 2013.
- 8Dubey et al. [2024] Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. ar Xiv preprint ar Xiv:2407.21783 , 2024.
