Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers
Jackie Lin, Jiaqi Su, Nishit Anand, Zeyu Jin, Minje Kim, Paris Smaragdis

TL;DR
Gencho is a novel diffusion-transformer model that generates diverse, realistic room impulse responses from reverberant speech and text, improving flexibility and performance in acoustic simulation tasks.
Contribution
It introduces a structure-aware encoder and diffusion decoder for complex spectrogram RIR generation, enabling controllable and high-quality acoustic modeling.
Findings
Richer RIRs than non-generative baselines
Strong performance on standard RIR metrics
Effective text-conditioned RIR generation
Abstract
Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis
