MULTIMODAL ANALYSIS: Informed content estimation and audio source separation
Gabriel Meseguer-Brocal

TL;DR
This paper explores multimodal learning combining audio signals and lyrics to improve music source separation and content estimation, emphasizing the unique connection between singing voice, melody, and lyrics.
Contribution
It introduces a novel approach focusing on the interaction between audio and lyrics for enhanced source separation and content estimation in musical signals.
Findings
Improved source separation accuracy using lyrics information
Enhanced content estimation through multimodal analysis
Demonstrated the effectiveness of combining audio and text data
Abstract
This dissertation proposes the study of multimodal learning in the context of musical signals. Throughout, we focus on the interaction between audio signals and text information. Among the many text sources related to music that can be used (e.g. reviews, metadata, or social network feedback), we concentrate on lyrics. The singing voice directly connects the audio signal and the text information in a unique way, combining melody and lyrics where a linguistic dimension complements the abstraction of musical instruments. Our study focuses on the audio and lyrics interaction for targeting source separation and informed content estimation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
