Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

Jackie Lin; Jiaqi Su; Nishit Anand; Zeyu Jin; Minje Kim; Paris Smaragdis

arXiv:2602.09233·cs.SD·February 11, 2026

Gencho: Room Impulse Response Generation from Reverberant Speech and Text via Diffusion Transformers

Jackie Lin, Jiaqi Su, Nishit Anand, Zeyu Jin, Minje Kim, Paris Smaragdis

PDF

Open Access

TL;DR

Gencho is a novel diffusion-transformer model that generates diverse, realistic room impulse responses from reverberant speech and text, improving flexibility and performance in acoustic simulation tasks.

Contribution

It introduces a structure-aware encoder and diffusion decoder for complex spectrogram RIR generation, enabling controllable and high-quality acoustic modeling.

Findings

01

Richer RIRs than non-generative baselines

02

Strong performance on standard RIR metrics

03

Effective text-conditioned RIR generation

Abstract

Blind room impulse response (RIR) estimation is a core task for capturing and transferring acoustic properties; yet existing methods often suffer from limited modeling capability and degraded performance under unseen conditions. Moreover, emerging generative audio applications call for more flexible impulse response generation methods. We propose Gencho, a diffusion-transformer-based model that predicts complex spectrogram RIRs from reverberant speech. A structure-aware encoder leverages isolation between early and late reflections to encode the input audio into a robust representation for conditioning, while the diffusion decoder generates diverse and perceptually realistic impulse responses from it. Gencho integrates modularly with standard speech processing pipelines for acoustic matching. Results show richer generated RIRs than non-generative baselines while maintaining strong…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Speech Recognition and Synthesis