TALKPLAY: Multimodal Music Recommendation with Large Language Models
Seungheon Doh, Keunwoo Choi, Juhan Nam

TL;DR
TALKPLAY introduces a multimodal music recommendation system leveraging large language models, encoding diverse music data into tokens, enabling end-to-end conversational recommendations with improved performance and natural language responses.
Contribution
The paper presents a novel multimodal music tokenizer and vocabulary expansion for LLMs, unifying recommendation and dialogue into a single end-to-end system.
Findings
Outperforms unimodal approaches in recommendation accuracy.
Effectively handles long conversational contexts.
Generates natural language responses for user interaction.
Abstract
We present TALKPLAY, a novel multimodal music recommendation system that reformulates recommendation as a token generation problem using large language models (LLMs). By leveraging the instruction-following and natural language generation capabilities of LLMs, our system effectively recommends music from diverse user queries while generating contextually relevant responses. While pretrained LLMs are primarily designed for text modality, TALKPLAY extends their scope through two key innovations: a multimodal music tokenizer that encodes audio features, lyrics, metadata, semantic tags, and playlist co-occurrence signals; and a vocabulary expansion mechanism that enables unified processing and generation of both linguistic and music-relevant tokens. By integrating the recommendation system directly into the LLM architecture, TALKPLAY transforms conventional systems by: (1) unifying previous…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Music History and Culture
