Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation

Shijia Wang; Tianpei Ouyang; Qiang Xiao; Dongjing Wang; Yintao Ren; Songpei Xu; Da Guo; Chuanjiang Luo

arXiv:2508.20359·cs.IR·August 29, 2025

Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation

Shijia Wang, Tianpei Ouyang, Qiang Xiao, Dongjing Wang, Yintao Ren, Songpei Xu, Da Guo, Chuanjiang Luo

PDF

TL;DR

This paper introduces a novel multimodal music recommendation framework that preserves semantic integrity across modalities and models cross-modal interests, leading to improved performance and practical deployment success.

Contribution

The paper proposes Progressive Semantic Residual Quantization and Multi-Codebook Cross-Attention to enhance multimodal interest modeling in music recommendation.

Findings

01

Outperforms state-of-the-art baselines on multiple datasets.

02

Achieves significant improvements in online A/B tests.

03

Successfully deployed in a major music streaming platform.

Abstract

In music recommendation systems, multimodal interest learning is pivotal, which allows the model to capture nuanced preferences, including textual elements such as lyrics and various musical attributes such as different instruments and melodies. Recently, methods that incorporate multimodal content features through semantic IDs have achieved promising results. However, existing methods suffer from two critical limitations: 1) intra-modal semantic degradation, where residual-based quantization processes gradually decouple discrete IDs from original content semantics, leading to semantic drift; and 2) inter-modal modeling gaps, where traditional fusion strategies either overlook modal-specific details or fail to capture cross-modal correlations, hindering comprehensive user interest modeling. To address these challenges, we propose a novel multimodal recommendation framework with two…

Tables4

Table 1. Table 1 . Statistics of datasets

Dataset	#Users	#Items	#Interactions
Amazon baby	81,423	33,652	230,444
Industrial	4,926,656	1,387,247	8,696,093
Music4all	14,127	99,596	2,597,382

Table 2. Table 2 . Performance of all baselines on various datasets is evaluated using AUC (for all items and cold-start items), with model fit assessed via Logloss. The top model is in bold, the second underlined. ”%Improv.” denotes percentage-based relative improvement over the best baseline. ”All AUC” is calculated on all items, while ”Cold AUC” is for cold-start items (fewer than 30 interactions).

Methods	Amazon Baby			Industrial			Music4all
Methods	All AUC	Cold AUC	Logloss	All AUC	Cold AUC	Logloss	All AUC	Cold AUC	Logloss
VBPR	0.6466	0.5377	2.6145	0.7407	0.7229	1.4869	0.6217	0.5174	5.7460
SimTier+MAKE	0.6213	0.5286	2.7974	0.7537	0.7446	1.2503	0.6871	0.6041	3.6962
DIN	0.6492	0.5487	2.5766	0.7599	0.7382	1.2188	0.7260	0.6699	2.9386
QARM	0.6557	0.5681	2.5686	0.7628	0.7429	1.2068	0.7347	0.7336	2.9181
PSRQ+MCCA	0.6573	0.5781	2.5564	0.7636	0.7535	1.2006	0.7347	0.7373	2.9070
%Improv.	+0.24%	+1.76%	-0.47%	+1.04%	+1.20%	-0.51%	+0.00%	+0.50%	-0.38%

Table 3. Table 3 . Augment only textual semantic IDs to the DIN model to fairly assess the performance of each quantization approach. The top performer is in bold, and the second is underlined.

Methods	Amazon Baby			Industrial			Music4all
Methods	All AUC	Cold AUC	Logloss	All AUC	Cold AUC	Logloss	All AUC	Cold AUC	Logloss
DIN	0.6492	0.5487	2.5766	0.7599	0.7382	1.2188	0.7260	0.6699	2.9386
+PQ	0.6534	0.5515	2.5754	0.7617	0.7387	1.2050	0.7313	0.7321	2.9234
+VQ	0.6520	0.5569	2.5725	0.7623	0.7411	1.2087	0.7328	0.7317	2.9146
+RQ	0.6531	0.5546	2.5661	0.7628	0.7334	1.2077	0.7329	0.7327	2.9204
+RQ-VAE	0.6535	0.5533	2.5711	0.7620	0.7400	1.2081	0.7338	0.7322	2.9184
+PSRQ	0.6540	0.5610	2.5705	0.7630	0.7442	1.2003	0.7345	0.7331	2.9183

Table 4. Table 4 . Ablate the role of modality-specific and cross-modal shared queries within the MCCA framework.

Methods	Amazon Baby			Industrial			Music4all
Methods	All AUC	Cold AUC	Logloss	All AUC	Cold AUC	Logloss	All AUC	Cold AUC	Logloss
w/o MSC	0.6544	0.5549	2.5594	0.7627	0.7424	1.1991	0.7333	0.7281	2.9184
w/o MJC	0.6555	0.5619	2.5598	0.7631	0.7467	1.2003	0.7353	0.7364	2.9089
PSRQ+MCCA	0.6573	0.5781	2.5564	0.7636	0.7535	1.2006	0.7347	0.7373	2.9070

Equations12

C_{1} C_{2} C_{l} = K-means (X_{1}^{m}, k), X_{1}^{m} = X^{m} = K-means (X_{2}^{m}, k), X_{2}^{t} = X_{1}^{m} - NearestRep (X_{1}^{m}, C_{1}) ⋮ = K-means (X_{l}^{m}, k), X_{l}^{m} = X_{l - 1}^{m} - NearestRep (X_{l - 1}^{m}, C_{l - 1})

C_{1} C_{2} C_{l} = K-means (X_{1}^{m}, k), X_{1}^{m} = X^{m} = K-means (X_{2}^{m}, k), X_{2}^{t} = X_{1}^{m} - NearestRep (X_{1}^{m}, C_{1}) ⋮ = K-means (X_{l}^{m}, k), X_{l}^{m} = X_{l - 1}^{m} - NearestRep (X_{l - 1}^{m}, C_{l - 1})

\begin{split}X^{m}_{1}&=X^{m}\\ C_{1}&=\text{K-means}(X^{m}_{1},k)\\ X^{m}_{2}&=X^{m}-\text{NearestRep}(X^{m}_{1},C_{1})\\ C_{2}&=\text{K-means}\bigl{(}X^{m}_{2}\oplus(X^{m}-X^{m}_{2}),\;k\bigr{)}\\ &\;\;\vdots\\ X^{m}_{l}&=X^{m}_{l-1}-\text{NearestRep}(X^{m}_{l-1},C_{l-1})\\ C_{l}&=\text{K-means}\bigl{(}X^{m}_{l}\oplus(X^{m}-X^{m}_{l}),\;k\bigr{)}\end{split}

\begin{split}X^{m}_{1}&=X^{m}\\ C_{1}&=\text{K-means}(X^{m}_{1},k)\\ X^{m}_{2}&=X^{m}-\text{NearestRep}(X^{m}_{1},C_{1})\\ C_{2}&=\text{K-means}\bigl{(}X^{m}_{2}\oplus(X^{m}-X^{m}_{2}),\;k\bigr{)}\\ &\;\;\vdots\\ X^{m}_{l}&=X^{m}_{l-1}-\text{NearestRep}(X^{m}_{l-1},C_{l-1})\\ C_{l}&=\text{K-means}\bigl{(}X^{m}_{l}\oplus(X^{m}-X^{m}_{l}),\;k\bigr{)}\end{split}

e_{i}^{z} = j = 1 \sum l one-hot (i d_{ij}^{z}) \times E_{j}^{z}, i d_{ij}^{z} \in S_{i}^{z}

e_{i}^{z} = j = 1 \sum l one-hot (i d_{ij}^{z}) \times E_{j}^{z}, i d_{ij}^{z} \in S_{i}^{z}

h_{u}^{z} = j = 1 \sum n Cross-Attention (Q = e_{t}^{o}, K = e_{j}^{z}, V = e_{j}^{z}) = j = 1 \sum n Attention (e_{j}^{z} \oplus e_{t}^{o}) \cdot e_{j}^{z}

h_{u}^{z} = j = 1 \sum n Cross-Attention (Q = e_{t}^{o}, K = e_{j}^{z}, V = e_{j}^{z}) = j = 1 \sum n Attention (e_{j}^{z} \oplus e_{t}^{o}) \cdot e_{j}^{z}

h_{u}^{r} = j = 1 \sum n Attention (e_{j}^{r} \oplus e_{t}^{r}) \cdot e_{j}^{r}

h_{u}^{r} = j = 1 \sum n Attention (e_{j}^{r} \oplus e_{t}^{r}) \cdot e_{j}^{r}

L = - \frac{1}{N} j = 1 \sum N y_{j} lo g σ (\overset{y}{^}_{j}) + (1 - y_{j}) lo g (1 - σ (\overset{y}{^}_{j}))

L = - \frac{1}{N} j = 1 \sum N y_{j} lo g σ (\overset{y}{^}_{j}) + (1 - y_{j}) lo g (1 - σ (\overset{y}{^}_{j}))

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Progressive Semantic Residual Quantization for Multimodal-Joint Interest Modeling in Music Recommendation

Shijia Wang

[email protected]

NetEase Cloud MusicHangzhouChina

,

Tianpei Ouyang

[email protected]

NetEase Cloud Music

Hangzhou Dianzi UniversityHangzhouChina

,

Qiang Xiao

[email protected]

NetEase Cloud MusicHangzhouChina

,

Dongjing Wang

[email protected]

Hangzhou Dianzi UniversityHangzhouChina

,

Yintao Ren

[email protected]

NetEase Cloud MusicHangzhouChina

,

Songpei Xu

[email protected]

NetEase Cloud MusicHangzhouChina

,

Da Guo

[email protected]

NetEase Cloud MusicHangzhouChina

and

Chuanjiang Luo

[email protected]

NetEase Cloud MusicHangzhouChina

(2025)

Abstract.

In music recommendation systems, multimodal interest learning is pivotal, which allows the model to capture nuanced preferences, including textual elements such as lyrics and various musical attributes such as different instruments and melodies. Recently, methods that incorporate multimodal content features through semantic IDs have achieved promising results. However, existing methods suffer from two critical limitations: 1) intra-modal semantic degradation, where residual-based quantization processes gradually decouple discrete IDs from original content semantics, leading to semantic drift; and 2) inter-modal modeling gaps, where traditional fusion strategies either overlook modal-specific details or fail to capture cross-modal correlations, hindering comprehensive user interest modeling. To address these challenges, we propose a novel multimodal recommendation framework with two stages. In the first stage, our Progressive Semantic Residual Quantization (PSRQ) method generates modal-specific and modal-joint semantic IDs by explicitly preserving the prefix semantic feature. In the second stage, to model multimodal interest of users, a Multi-Codebook Cross-Attention (MCCA) network is designed to enable the model to simultaneously capture modal-specific interests and perceive cross-modal correlations. Extensive experiments on multiple real-world datasets demonstrate that our framework outperforms state-of-the-art baselines. This framework has been deployed on one of China’s largest music streaming platforms, and online A/B tests confirm significant improvements in commercial metrics, underscoring its practical value for industrial-scale recommendation systems.

Music Recommendation, Multimodal Representation, Residual Quantization, Semantic ID

††journalyear: 2025††copyright: acmlicensed††conference: Proceedings of the 34th ACM International Conference on Information and Knowledge Management; November 10–14, 2025; Seoul, Republic of Korea††booktitle: Proceedings of the 34th ACM International Conference on Information and Knowledge Management (CIKM ’25), November 10–14, 2025, Seoul, Republic of Korea††doi: 10.1145/3746252.3761579††isbn: 979-8-4007-2040-6/2025/11††ccs: Information systems Recommender systems

1. Introduction

In contemporary music streaming platforms, users exhibit varying preferences across different musical modalities, such as lyrics, instrumental, and melodic. Even among different demographic groups, the emphasis on modal interests can differ significantly. For instance, Fiore et al. (Fiore, 2016) found that adults focus more on lyrics, while children prioritize melody. Further research (Ma et al., 2021; Sangnark et al., 2021) has indicated that different musical modalities can have distinct impacts on users’ emotions. However, traditional recommendation models primarily rely on collaborative filtering (Linden et al., 2003), focus on modeling users’ behavior preference (Zhou et al., 2018), and lack the ability to learn multimodal interests.

In recent years, with the continuous advancement of multimodal feature extraction techniques, an increasing number of studies have applied multimodal information to fields such as short video and music recommendations (Wang et al., 2025; He and McAuley, 2016; Wei et al., 2019; Yu et al., 2022; Xu et al., 2025; Yang et al., 2024). These studies have shown that integrating multimodal information can significantly enhance the performance of recommendation systems by capturing the nuanced preferences of users. As researchers have delved deeper into multimodal recommendation, a central insight has emerged that semantic representation spaces of different modalities, including visual, textual, and acoustic features, exhibit pronounced disparities(Girdhar et al., 2023; Li et al., 2021b; Lin et al., 2023). These differences can hinder the integration of cross-modal cues, which are crucial for understanding the subtleties of user preferences(Liu et al., 2024; Hu et al., 2023; Rafailidis et al., 2017). In addition, traditional content representations are inherently static, as they are precomputed and fixed during training, posing a challenge as they cannot be optimized within an end-to-end recommendation framework. This limitation may impede the model’s adaptability to complex interaction patterns and could result in slower convergence during training(Sheng et al., 2024). The question of how to bridge multimodal representations with recommendation systems to enable end-to-end training has become a difficult problem.

Recently, quantization techniques have been widely applied in various fields and achieved remarkable results (Babenko and Lempitsky, 2014; Li et al., 2021a, 2017). Among them, VQ-Rec(Hou et al., 2023) identified key challenges in vector quantization methods for recommender systems (VQ4Rec) and presented promising opportunities that can inspire future research in this emerging field. TIGER (Rajput et al., 2023) further introduced the Residual Quantized Variational AutoEncoder (RQ-VAE (Lee et al., 2022)), officially opening the door of transformation of content feature into semantic IDs in the field of recommendation, through the application of the codebook (Martinez et al., 2014; Lee et al., 2022). Subsequently, numerous studies (Singh et al., 2024; Li et al., 2025) have demonstrated that semantic ID-based representation can bridge the aforementioned representation gap while endowing models with end-to-end adaptability for multimodal information. Furthermore, once the multimodal content of a cold-start item is mapped to a semantic ID, the item immediately inherits the learnable embedding of this ID from the codebook. This yields reliable and training-efficient representations, which significantly mitigate the challenges of data sparsity and cold-start problems (Luo et al., 2024; Chen et al., 2024).

Despite these advancements, two critical challenges remain:

•

intra-semantic of multimodal: Current approaches rely purely on geometric similarity (e.g., Euclidean distance or cosine similarity between residuals) for quantization. While Residual Quantization (RQ)(Lee et al., 2022) and RQ-VAE improve the accuracy of semantic IDs through iterative residual approximation, their layer-wise quantization process inherently decouples residual vectors from original semantic meanings, overlooks hierarchical semantic alignment—the deeper the quantization layer, the weaker the connection to the original semantics. As shown in Figure 1 (c), the residual vectors can lead to more diverse and discrete clustering results, but tend to overlook the associations with the original semantics. Consequently, the generated cluster IDs may deviate from the intended item semantics, leading to suboptimal recommendation performance.

•

inter-semantic of multimodal: Existing paradigms such as QARM(Luo et al., 2024) and OneRec(Deng et al., 2025) first fuse multimodal features via contrastive learning before quantization, which inevitably suppresses modal-unique signals critical for fine-grained user preference modeling. While M3CRS (Chen et al., 2024) preserves modal-specific characteristics through an independent embedding table, its isolated modeling the modal-specific interest of user fails to capture cross-modal synergies (e.g., audio-lyrics complementarity in music). However, in the context of recommendation systems, both aspects are of paramount importance(Sheng et al., 2024; Liu et al., 2024; Chen et al., 2024, 2022; Zhou et al., 2025). Therefore, the second challenge is how to simultaneously capture fine-grained modal preferences and exploit complementary cross-modal correlations for multimodal interest modeling based on semantic IDs.

To address these challenges, we propose a multimodal quantization-based recommendation framework that enhances both semantic fidelity and cross-modal interaction. In the feature engineering stage, we use a novel Progressive Semantic Residual Quantization (PSRQ) method preprocesses multimodal embeddings by explicitly preserving prefix semantic feature, generating modal-specific and joint semantic IDs that maintain strong alignment with original semantics. Then, for the multimodal interest modeling of user, we introduce the Multi-Codebook Cross-Attention(MCCA) Network, which employs a shared modal-joint codebook as a cross-modal query to model multimodal embedding sequences. This approach operates end-to-end in the ranking stage(Wang et al., 2024b) of the recommendation system, jointly optimizing for semantic consistency and adaptive multimodal fusion to achieve superior recommendation performance.

In summary, the contributions of our study are as follows:

•

We proposed a novel Progressive Semantic Residual Quantization method that constrains residual quantization with prefix semantics, enhancing semantic preserve.

•

We proposed a Multi-Codebook Cross-Attention Network for multimodal interest learning, simultaneously capturing modality specificity and cross-modal associations.

•

Extensive offline experiments on three real-world datasets and online A/B tests verified the effectiveness of proposed method, significantly improving cold start performance metrics.

2. Related Work

2.1. Multimodal Representation for Recommendation

In recent years, multimodal content features have achieved exciting results in enhancing recommendation systems. Early representative work, such as VBPR(He and McAuley, 2016), introduced visual features into the recommendation field using matrix factorization. MMGCN(Wei et al., 2019) modeled the fine-grained modality of user preferences for each user-item bipartite graph of different modalities. To further improve multimodal based recommendation, the problem of multimodal interest representation fusion is crucial. A common approach is to fuse the multimodal embeddings of items through pre-training tasks. AlignRec(Liu et al., 2024) pre-trained the visual-text alignment task using a mask-then-predict strategy. Sheng et al. (Sheng et al., 2024) refined users’ multimodal interest representations through SimTier for the multimodal representations after contrastive learning pre-training tasks, and used MAKE to address the issue of the difference in training epochs required for ID features versus multimodal representations.

2.2. Quantitative Representation Learning for Recommendation

Quantitative representation learning has recently attracted extensive attention from numerous scholars due to its ability to extract semantic information, and its effectiveness has been demonstrated in multiple fields. VQ-Rec(Hou et al., 2023) converts text content vectors into sparse semantic ID representations through product quantization (PQ). TIGER(Rajput et al., 2023) further utilizes RQ-VAE to generate hierarchical semantic IDs based on the text content feature as item representations. Although RQ and RQ-VAE approximate the original embeddings through residuals to improve the accuracy of quantitative representations, they still face the hourglass problem(Kuai et al., 2024), resulting in uneven distribution in the discrete space and limited deep-level semantic associations. OneRec(Deng et al., 2025) enforces a uniform distribution of the number of elements in each layer of RQ through a multi-level balanced quantitative mechanism. Singh et al. (Singh et al., 2024) illustrated that Semantic IDs with Sentence Piece Models (SPM)(Kudo, 2018) are a more adaptive and efficient solution to represent item content and achieve better generalization outcomes. Furthermore, Zheng et al. (Zheng et al., 2025) propose a prefix ngram parameterization method and prove that incorporating the hierarchical nature of clustering into the embedding table mapping is an effective measure.

3. Preliminary

Problem Definition. Let $\mathcal{U}$ and $\mathcal{I}$ denote the sets of users and items, respectively. And $|\mathcal{U}|$ and $|\mathcal{I}|$ denote user and item number. For all items, we obtain their multimodal content embeddings $X^{m}\in\mathbb{R}^{|\mathcal{I}|\times d}$ based on existing content feature extraction methods. Here, $m\in\{v,t,a\}$ , where $v$ represents visual, $t$ represents textual, and $a$ represents audio, respectively. The specific content feature extraction methods are detailed in Section 5.1.1. For each user $u\in\mathcal{U}$ , we construct his historical behavior sequences $\mathcal{H}_{u}=\{i_{1}^{h},i_{2}^{h},\ldots,i_{n}^{h}\}$ based on positive interactions such as click, comment, or collect. In this sequence, $i_{n}^{h}\in\mathcal{I}$ denotes the item associated with the $n$ -th interaction. Our recommendation task involves predicting the probability $\hat{y}_{u,t}$ that the user $u$ will positively interact with the target item $i_{t}\in\mathcal{I}$ .

Residual Quantization. In the conventional K-means-based Residual Quantization (RQ) process, each layer takes the residual vector from the previous layer as input and applies the K-means algorithm to obtain the cluster centers, which form the codebook for the layer. For each kind of multimodal embeddings $X^{m}_{i}$ :

[TABLE]

where $l$ is the number of quantization layers, $C_{l}\in\mathbb{R}^{k\times d}$ are the generated cluster center embeddings of layer $l$ , $k$ is the number of cluster centers of K-means, and the $\text{NearestRep}(\cdot)$ denote the nearest representation search in the cluster center embeddings. The semantic IDs retrieval process of RQ is shown in Figure 2 (a).

4. Methodology

In this section, we elaborate on the components of our framework and its overall deployment workflow, depicted in Figure 3, which consists of two stages, feature engineering and downstream recommendation model training.

4.1. Progressive Semantic Residual Quantization

In the feature engineering phase, inspired by previous work (Chen et al., 2024; Luo et al., 2024), we do not directly utilize original static content multimodal content embeddings $X^{m}\in\mathbb{R}^{|\mathcal{I}|\times d}$ . Instead, we employ our proposed Progressive Semantic Residual Quantization(PSRQ) method to map these embeddings to semantic ID representations. Different with RQ, the PSRQ introduces a critical modification that the residual vector is differentiated from the original content feature vector and then concatenated with it for enhancing the retention of the original semantic information, as detailed in the following equations:

[TABLE]

where $\oplus$ denote concat operation and $C^{1}\in\mathbb{R}^{k\times d}$ , $C^{2}$ to $C^{l}$ all $\in\mathbb{R}^{k\times 2d}$ , and $m\in\{v,t,a\}$ . In our online system, we only used textual and audio modal embeddings, that $m=\{t,a\}$ . Specifically, for the modal-joint information, we concatenated $X^{m}$ as modal-joint embeddings $X^{o}\in\mathbb{R}^{|\mathcal{I}|\times 2d}$ to perform PSRQ and generate modal-joint semantic IDs $S^{o}_{i}$ for each item, where $o$ represents multimodal joint information.

Then for each item $i$ , we retrieve the nearest cluster ID $id^{t}\in(0,1,\dots,k-1)$ from each quantization layer as the semantic IDs $S^{m}_{i}=[id^{m}_{1},id^{m}_{2},\dots,id^{m}_{l}]$ . The different of semantic IDs retrieve process between conventional RQ and PSRQ is shown in Figure 2.

4.2. Multi-Codebook Cross-Attention

Following the quantization of modal-specific and joint content representations, we integrate semantic IDs into collaborative filtering-based recommendation models. This integration enables the modeling of user multimodal interests, enhancing the generalization capability.

4.2.1. Hierarchical Embedding Layer

To enable end-to-end optimization of content features beyond the constraints of static representations, we do not use the original cluster centroid embeddings as semantic IDs embedding. Instead, within the embedding layer of our model, we use randomly initialized embedding tables for both the modal-specific and joint semantic IDs generated through the quantization codebooks. Specifically, for the modal-specific semantic IDs $S^{m}_{i}$ and modal-joint semantic IDs $S^{o}_{i}$ of each item, we allocate randomly initialized embedding tables denoted as $E^{z}\in\mathbb{R}^{k\times d^{\prime}}$ for each quantization layer, where $d^{\prime}$ is the embedding size and $z=\{t,a,o\}$ , which $t$ represents textual, $a$ represents audio, and $o$ represents modal-joint information. Then we can retrieve the semantic ID embeddings of each layer and aggregate them as the final semantic representation ${e^{z}_{i}}\in\mathbb{R}^{d^{\prime}}$ of each item in the user history sequence.

[TABLE]

where one-hot is a commonly used option, which encode the $id_{ij}$ into a one-hot vector. As this method, we form the modal-specific and joint semantic embedding sequences $\{e_{1}^{z},e_{2}^{z},\ldots,e_{n}^{z}\}$ .

For the target item $i_{t}$ , we only use the modal-joint codebook $E^{o}$ to obtain its modal-joint semantic embedding ${e^{o}_{t}}\in\mathbb{R}^{d^{\prime}}$ , which aims to capture cross-modal associations.

Furthermore, we integrate collaborative signals from the recommendation system by leveraging the sequence of ID embeddings $\{e_{1}^{r},e_{2}^{r},\ldots,e_{n}^{r}\}$ and the target item ID embedding $e_{t}^{r}\in\mathbb{R}^{d^{\prime}}$ . This approach ensures that the model effectively captures the collaborative patterns within the data.

4.2.2. Cross-Attention Layer

After obtaining the user modal-specific and joint embedding sequences, we use the cross-attention mechanism to model the user modal-specific interests while capturing cross-modal correlations. The modal-joint semantic IDs embedding of the target item ${e^{o}_{t}}$ is utilized as a shared query to calculate the attention scores over the modal-specific and joint embedding sequences of the user history behaviors. This design is capable of alleviating the issue of inconsistent representation spaces across modalities to a certain extent, while also supporting the capture of cross-modal correlations. The multimodal interest representations ${h^{z}_{u}}\in\mathbb{R}^{d^{\prime}}$ are then computed using attention-weighted aggregation:

[TABLE]

where $\text{Attention}(\cdot)$ is a feed-forward network with output as the activation weight.

To capture collaborative patterns, we also apply the attention mechanism to the collaborative embedding sequence $\mathcal{H}_{u}^{r}$ and then obtain the collaborative interest representation of the user:

[TABLE]

where $h_{u}^{r}$ is collaborative interest representation of the user.

4.3. Model Prediction & Optimization

To derive the probability of positive interaction between the user and the target item, we concatenate the multimodal interest representations ${h^{z}_{u}}$ , collaborative interest representation $h_{u}^{r}$ with the target item’s collaborative embedding ${e_{t}^{r}}$ and modal-joint semantic IDs embedding ${e^{o}_{t}}$ . This concatenated vector is then fed into a Multi-layer Perceptron(MLP) for the predicted logit $\hat{y}_{j}$ . Since the positive interaction prediction is a binary classification task, we employ cross-entropy loss as the objective function for model training and optimization:

[TABLE]

where $N$ is the total number of training instances and $y_{j}\in\{0,1\}$ is the label for each sample.

5. Experiments

In this section, we conduct a variety of offline experiments and online tests to evaluate the proposed method. Specifically, our aim is to address the following research questions.

•

RQ1 How does the proposed quantization-based recommendation framework compare in performance to the general and SOTA multimodal recommendation methods?

•

RQ2 How does the semantic IDs generated from PSRQ method perform in recommendation tasks compared to other quantization methods?

•

RQ3 Does the modal-specific and modal-joint codebooks in MCCA effectiveness

•

RQ4 How does the overall performance of the proposed method (PSRQ+MCCA) fare in real-world online scenarios?

5.1. Experimental Settings

5.1.1. Datasets

To evaluate the performance of the proposed method, we conduct experiments on an industrial dataset and two public datasets, including Amazon Baby(Hou et al., 2024), and Music4all(Santana et al., 2020). Detailed data statistics and multimodal information of each dataset are presented in Table 1. For all datasets, we mark items with fewer than 30 interactions as cold-start items.

•

Amazon Baby: For the baby benchmark in the Amazon review dataset, we define interactions with ratings of 4 or higher as positive samples and those with ratings below 4 as negative samples. User interaction sequences are constructed based on timestamps. The image features of the items are extracted using the preexisting CNN-based method provided by the dataset. Textual features are obtained by the LLaMA3.2-1B(Malinovskii et al., 2024) model from the text composed of the product title, description, brand, and categorical information.

•

Industrial: This dataset is collected from our online music platform over a one-week period. To construct the training samples, we treat users’ song collection interactions as positive samples and songs that users played but did not collect as negative samples; user historical interaction sequences are further built based on these positive interactions. For textual feature extraction, we leverage the Baichuan2-7B model (Yang et al., 2023) to generate textual embeddings from multi-source textual information, including song titles, genre tags, and lyrics. For audio feature extraction, we use the MERT-v1-95M model (Li et al., 2023) to extract audio embeddings directly from MP3-format audio files.

•

Music4all: We define repeatedly played songs as positive samples and songs played only once as negative samples. Text features are extracted by LLaMA3.2 from song titles, lyrics, genre tags, etc. The audio features are also sampled and extracted by the MERT model.

5.1.2. Implementation Details

We implement all methods in TensorFlow 2, while the epoch number is set to one, the training batch sizes in three datasets are {64, 512, 512} and the learning rates are {0.0005, 0.0001, 0.0001}. The dimension $d^{\prime}$ of both the multimodal semantic IDs embedding and ID embedding is set to 64. The maximum length of the user history sequence is truncated to 20. For all models, we adopt the Adam (Kingma and Ba, 2014) optimizer. Multimodal Large Language Models(Wang et al., 2024a) (MLLMs), including Baichuan2-7B, MERT-v1-95M, and LlaMA3.2-1B, are all deployed on NVIDIA A100 GPUs. For the proposed PSRQ and other quantization methods, we set the number of clusters $k$ in Amazon Baby, Industrial, and Music4all at {64, 256, 128}, and the number of layers $l$ of RQ, PQ, RQ-VAE, and PSRQ are {3,4,3,3}. Furthermore, to ensure the fairness of the training process, we maintain consistent parameter scales, learning rates, and batch sizes across all models at different datasets.

5.1.3. Evaluation Metrics

To comprehensively evaluate the performance of models, we adopt the AUC (Area Under the Curve) (Hanley and McNeil, 1982) as the primary evaluation metric. We define ”All AUC” as the AUC metric evaluated on all items, while ”Cold AUC” refers to the AUC metric for cold-start items. In addition, under the same model parameter magnitude, we provide the Logloss metric, with the specific formula detailed in Section 4.3.

5.2. Offline Performance Comparison

5.2.1. Comparison with the Recommendation Models Proposed for Industry Scenarios(RQ1)

To validate the effectiveness of the proposed framework, we compare the performance of our model with various baselines, including the latest SOTA models.

•

DIN(Zhou et al., 2018): This model utilizes the attention mechanism to dynamically capture users’ interests from their historical behaviors.

•

VBPR(He and McAuley, 2016): This model integrates the multimodal embeddings and ID embeddings of each item as its representation and uses the matrix factorization (MF) framework to reconstruct the historical interactions between users and items.

•

SimTier+MAKE(Sheng et al., 2024): An industrial-grade recommendation framework that combines similarity tiering and multimodal knowledge embedding to handle large-scale heterogeneous data.

•

QARM(Luo et al., 2024): A recent SOTA approach that integrates contrastive learning and quantization techniques to enhance recommendation efficiency while preserving model expressiveness.

As shown in Table 2, the proposed PSRQ+MCCA model achieves superior performance across most experimental results on the three datasets, particularly excelling in the recommendation of cold-start items. QARM, which integrates contrastive learning and quantization techniques to enhance semantic generalization, attains the second-best performance across multiple datasets. DIN demonstrates good fit due to the thorough end-to-end training of ID embeddings, despite the lack of multimodal information. In contrast, SimTier+MAKE only outperformed VBPR, likely due to insufficient training within a single epoch.

5.2.2. Comparison of Quantization Method (RQ2)

To rigorously assess the enhancement of the proposed PSRQ method over alternative quantization techniques in terms of content feature generalization and its impact on recommendation performance, we maintain identical batch sizes and learning rates alongside the DIN (Zhou et al., 2018) model, with only textual modal embedding. This approach helps mitigate the influence of extraneous factors, ensuring a focused evaluation of the quantization methods themselves. The quantization methods for comparison are as follows:

•

PQ(Jégou et al., 2011): It is a widely used technique that compresses high-dimensional vectors into a lower-dimensional space by dividing the vector into subvectors and quantizing each subvector independently.

•

VQ(van den Oord et al., 2017): This method involves compressing high-dimensional vectors into a lower-dimensional space using a codebook generated by K-means clustering.

•

RQ(Ferdowsi et al., 2017): RQ is also based on K-means clustering but focuses on quantization residuals iteratively to achieve more accurate approximations of the original vectors.

•

RQ-VAE(Lee et al., 2022): This approach extends the concept of RQ by integrating it with an autoencoder architecture, which helps in reconstructing rich semantic information.

Referring to Table 3, all quantization methods have enhanced the performance of the DIN model. Among them, PQ, lack of global semantic information, resulted in the poorest performance. VQ, despite employing only a single-layer quantization, achieved considerable improvements for cold-start items. Both RQ and RQ-VAE achieved the second-best results across multiple datasets. The PSRQ method demonstrated the overall best performance, including for all items and cold-start items in three datasets.

5.2.3. Ablation Study of MCCA (RQ3)

To validate the effectiveness of MCCA, we conduct ablation studies, aiming to isolate the impact of modal-specific and joint:

•

w/o modal-specific codebooks(w/o MSC): We only used a modal-joint codebook during attention modeling, eliminating the dedicated modal-specific interest representations. This ablation tests whether removing granular modal semantics weakens the ability of modal to capture fine-grained user preferences across different content modalities.

•

w/o modal-joint codebooks(w/o MJC): To assess the significance of cross-modal correlation modeling, we conducted experiments by excluding the shared modal-joint codebook for user’s sequence and the shared query. Instead, we utilized modal-specific semantic IDs embeddings to serve as queries for the respective modality’s semantic embedding sequences. This variant investigates whether the absence of a shared query mechanism hinders the capacity to exploit inter-modal dependencies, and affecting recommendation performance.

The results of the ablation study, as shown in Table 4, indicate that the MCCA framework outperforms on most datasets and metrics, validating the effectiveness of extracting cross-modal information through modal-joint codebooks and modeling each modality independently. Although in the Music4all dataset, MCCA’s performance for all items is slightly inferior to that of w/o MJC, the improvements for cold-start items demonstrate that cross-modal associations are beneficial for enhancing the model’s generalization performance, particularly for cold-start items.

5.3. Online A/B Tests (RQ4)

We conducted an A/B test of our online ranking model for our music streaming platform in February 2025, delivering song recommendations to tens of millions of users daily. The baseline model was the industry-standard Deep Learning Recommendation Model(DLRM(Naumov et al., 2019)). The experiment group enhanced the baseline by incorporating user multimodal interest representations, derived from PSRQ generated semantic IDs and MCCA. Our core metrics include user engagement with music tracks, specifically through behaviors such as $collect$ and $full\_play$ . The $collect$ behaviors represent users’ actions of adding songs to their favorite playlists, while the $full\_play$ behaviors signify that users have played the songs completely. During the trial period, the experiment group saw a 2.81% increase in $collect$ and a 0.95% increase in $full\_play$ compared to the control group. For new tracks released within the last 30 days, the probabilities of $collect$ and $full\_play$ increased by 5.98% and 2.2%, respectively. In addition, the listening hours of the new tracks increased by 3.05%.

6. Conclusion

In this work, we introduced a novel multimodal recommendation framework to address the persistent challenges of semantic degradation and cross-modal modeling gaps in music recommendation systems. Our Progressive Semantic Residual Quantization (PSRQ) method effectively preserves original semantic during quantization, while the Multi-Codebook Cross-Attention (MCCA) mechanism enables simultaneous capture of fine-grained multimodal interests and cross-modal correlations. Extensive experiments on multiple datasets demonstrated significant improvements, validating state-of-the-art performance of our framework. The successful deployment on a leading music streaming platform underscores its practical value in real-world scenarios. This study advances the field by bridging semantic fidelity and multimodal synergy, offering a scalable solution for industrial recommendation systems.

7. Acknowledgments

This research was supported by the Natural Science Foundation of Zhejiang Province under Grant No.LZ25F020010.

Bibliography49

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Babenko and Lempitsky (2014) Artem Babenko and Victor Lempitsky. 2014. Additive Quantization for Extreme Vector Compression. In 2014 IEEE Conference on Computer Vision and Pattern Recognition . 931–938. doi: 10.1109/CVPR.2014.124 · doi ↗
3Chen et al. (2022) Feiyu Chen, Junjie Wang, Yinwei Wei, Hai-Tao Zheng, and Jie Shao. 2022. Breaking isolation: Multimodal graph fusion for multimedia recommendation by edge-wise modulation. In Proceedings of the 30th ACM International Conference on Multimedia . 385–394.
4Chen et al. (2024) Gaode Chen, Ruina Sun, Yuezihan Jiang, Jiangxia Cao, Qi Zhang, Jingjian Lin, Han Li, Kun Gai, and Xinghua Zhang. 2024. A Multi-modal Modeling Framework for Cold-start Short-video Recommendation. In Proceedings of the 18th ACM Conference on Recommender Systems (Bari, Italy) (Rec Sys ’24) . Association for Computing Machinery, New York, NY, USA, 391–400. doi: 10.1145/3640457.3688098 · doi ↗
5Deng et al. (2025) Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, and Guorui Zhou. 2025. One Rec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment. ar Xiv:2502.18965 [cs.IR] https://arxiv.org/abs/2502.18965
6Ferdowsi et al. (2017) Sohrab Ferdowsi, Slava Voloshynovskiy, and Dimche Kostadinov. 2017. Regularized Residual Quantization: a multi-layer sparse dictionary learning approach. ar Xiv:1705.00522 [cs.LG] https://arxiv.org/abs/1705.00522
7Fiore (2016) Jennifer Fiore. 2016. Analysis of Lyrics from Group Songwriting with Bereaved Children and Adolescents. Journal of Music Therapy 53, 3 (05 2016), 207–231. ar Xiv:https://academic.oup.com/jmt/article-pdf/53/3/207/7953134/thw 005.pdf doi: 10.1093/jmt/thw 005 · doi ↗
8Girdhar et al. (2023) Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, and Ishan Misra. 2023. Image Bind One Embedding Space to Bind Them All. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) . 15180–15190. doi: 10.1109/CVPR 52729.2023.01457 · doi ↗