Zero-shot Cross-domain Knowledge Distillation: A Case study on YouTube Music
Srivaths Ranganathan, Nikhil Khani, Shawn Andrews, Chieh Lo, Li Wei, Gergo Varady, Jochen Klingenhoefer, Tim Steele, Bernardo Cunha, Aniruddh Nath, Yanwei Song

TL;DR
This paper demonstrates that zero-shot cross-domain knowledge distillation effectively enhances low-traffic music recommendation models by leveraging large-scale video recommendation data.
Contribution
It presents a case study applying zero-shot cross-domain KD to transfer knowledge from YouTube to a music app, addressing challenges in low-data environments.
Findings
Zero-shot cross-domain KD improves ranking model performance on low-traffic surfaces.
Different KD techniques were evaluated across two music ranking models.
Offline and live experiments confirm the practicality of the approach.
Abstract
Knowledge Distillation (KD) has been widely used to improve the quality of latency sensitive models serving live traffic. However, applying KD in production recommender systems with low traffic is challenging: the limited amount of data restricts the teacher model size, and the cost of training a large dedicated teacher may not be justified. Cross-domain KD offers a cost-effective alternative by leveraging a teacher from a data-rich source domain, but introduces unique technical difficulties, as the features, user interfaces, and prediction tasks can significantly differ. We present a case study of using zero-shot cross-domain KD for multi-task ranking models, transferring knowledge from a (100x) large-scale video recommendation platform (YouTube) to a music recommendation application with significantly lower traffic. We share offline and live experiment results and present findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
