Remixing Music with Visual Conditioning
Li-Chia Yang, Alexander Lerch

TL;DR
This paper introduces a novel system for music remixing that uses visual inputs, specifically images, to condition the separation and remixing of audio sources, enhancing quality over traditional methods.
Contribution
It adapts an audio-visual source separation model to work with images instead of videos and develops a remixing engine that improves audio quality in music remixing tasks.
Findings
Achieves better audio quality than separate-and-add methods
Successfully uses images as visual conditioning for audio source separation
Extends audio-visual models to user-selected images for remixing
Abstract
We propose a visually conditioned music remixing system by incorporating deep visual and audio models. The method is based on a state of the art audio-visual source separation model which performs music instrument source separation with video information. We modified the model to work with user-selected images instead of videos as visual input during inference to enable separation of audio-only content. Furthermore, we propose a remixing engine that generalizes the task of source separation into music remixing. The proposed method is able to achieve improved audio quality compared to remixing performed by the separate-and-add method with a state-of-the-art audio-visual source separation model.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Blind Source Separation Techniques
