Listen, Chat, and Remix: Text-Guided Soundscape Remixing for Enhanced Auditory Experience
Xilin Jiang, Cong Han, Yinghao Aaron Li, and Nima Mesgarani

TL;DR
This paper presents 'Listen, Chat, and Remix' (LCR), a user-friendly system that uses text prompts and large language models to remix sound mixtures by controlling individual sources without source separation.
Contribution
LCR introduces a novel multimodal sound remixing method that interprets text instructions to control multiple sound sources simultaneously within a mixture.
Findings
Significant signal quality improvements across remixing tasks
Robust zero-shot performance with diverse sound sources
Effective semantic filtering based on user prompts
Abstract
In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Remix" (LCR), a novel multimodal sound remixer that controls each sound source in a mixture based on user-provided text instructions. LCR distinguishes itself with a user-friendly text interface and its unique ability to remix multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for remixing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles filtered components back to the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNoise Effects and Management
