SonicRAG : High Fidelity Sound Effects Synthesis Based on Retrival Augmented Generation
Yu-Ren Guo, Wen-Kai Tai

TL;DR
SonicRAG introduces a retrieval-augmented framework combining large language models with sound effect databases to improve the diversity and quality of high-fidelity sound effects synthesis without extra recording costs.
Contribution
The paper presents a novel retrieval-augmented sound effects synthesis framework that enhances audio quality and diversity by integrating LLMs with sound effect databases.
Findings
Improved sound effect diversity and quality.
Elimination of additional recording costs.
Flexible and efficient sound design process.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language processing (NLP) and multimodal learning, with successful applications in text generation and speech synthesis, enabling a deeper understanding and generation of multimodal content. In the field of sound effects (SFX) generation, LLMs have been leveraged to orchestrate multiple models for audio synthesis. However, due to the scarcity of annotated datasets, and the complexity of temproal modeling. current SFX generation techniques still fall short in achieving high-fidelity audio. To address these limitations, this paper introduces a novel framework that integrates LLMs with existing sound effect databases, allowing for the retrieval, recombination, and synthesis of audio based on user requirements. By leveraging this approach, we enhance the diversity and quality of generated sound effects while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music Technology and Sound Studies
