Adapting a Text-to-Audio Model for Room Impulse Response Generation
Kirak Kim, Sungyoung Kim

TL;DR
This paper introduces a novel method for generating Room Impulse Responses by adapting a pre-trained text-to-audio model, utilizing vision-language models for data labeling and in-context learning for flexible user prompts.
Contribution
It demonstrates for the first time that large-scale generative audio priors can be adapted for RIR generation using a new labeling pipeline and in-context learning strategy.
Findings
Generated RIRs are perceptually plausible according to listening tests.
The approach leverages vision-language models to create labeled data from image-RIR datasets.
The method enables flexible user prompts through in-context learning.
Abstract
Room Impulse Responses (RIRs) enable realistic acoustic simulation, with applications ranging from multimedia production to speech data augmentation. However, acquiring high-quality real-world RIRs is labor-intensive, and data scarcity remains a challenge for data-driven RIR generation approaches. In this paper, we propose a novel approach to RIR generation by adapting a pre-trained text-to-audio model, demonstrating for the first time that large-scale generative audio priors can be effectively leveraged for the task. To address the lack of text-RIR paired data, we utilize a labeling pipeline leveraging vision-language models to extract acoustic descriptions from existing image-RIR datasets. We introduce an in-context learning strategy to accommodate free-form user prompts during inference. Evaluations including subjective listening test demonstrate that our model generates plausible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
