MTCRNN: A multi-scale RNN for directed audio texture synthesis
M. Huzaifah, L. Wyse

TL;DR
This paper introduces MTCRNN, a multi-scale RNN model that captures complex audio textures across multiple timescales, enabling user-directed synthesis of environmental sounds like rain and wind.
Contribution
The paper presents a novel multi-scale RNN architecture with a conditioning strategy for improved audio texture synthesis, addressing limitations of traditional methods.
Findings
Effective modeling of diverse environmental sounds
Enhanced synthesis quality with user control
Good performance on multiple datasets
Abstract
Audio textures are a subset of environmental sounds, often defined as having stable statistical characteristics within an adequately large window of time but may be unstructured locally. They include common everyday sounds such as from rain, wind, and engines. Given that these complex sounds contain patterns on multiple timescales, they are a challenge to model with traditional methods. We introduce a novel modelling approach for textures, combining recurrent neural networks trained at different levels of abstraction with a conditioning strategy that allows for user-directed synthesis. We demonstrate the model's performance on a variety of datasets, examine its performance on various metrics, and discuss some potential applications.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
