TL;DR
DGSNA introduces a prompt-driven, generative approach to create diverse, scene-specific noise for speech systems, improving robustness without relying on pre-existing noise libraries.
Contribution
It combines generative language models with diffusion-based audio synthesis to dynamically generate realistic scene-based noise for speech data augmentation.
Findings
Achieves up to 11.32% relative improvement in speech recognition robustness.
Effectively simulates diverse acoustic environments without pre-existing noise datasets.
Highly compatible with existing noise addition techniques.
Abstract
To ensure the reliable operation of speech systems across diverse environments, noise addition methods have emerged as the standard solution.However, existing methods offer limited coverage of real-world scenes and depend on pre-existing noise libraries and scene metadata.This paper presents prompt-based Dynamic Generative Scene-based Noise Addition (DGSNA), a novel approach driven by generative language models that integrates Dynamic Generation of Scene-based Information (DGSI) with Scene-based Noise Addition for Speech (SNAS).The DGSI module, with a BET (Background, Examples, Task) prompt framework, dynamically generates logic-compliant scene-based information, including scene dimensions, sound sources, and microphone positions, thereby addressing the challenges of scene enumeration and detailed description.Complementing this, the SNAS module employs a Time-Frequency Diffusion-based…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
