Attention Is All You Need For Blind Room Volume Estimation
Chunxi Wang, Maoshen Jia, Meiran Li, Changchun Bao, Wenyu Jin

TL;DR
This paper introduces a novel attention-based Transformer model for blind room volume estimation from noisy speech, outperforming CNN-based methods by leveraging self-attention, transfer learning, and data augmentation.
Contribution
It presents the first purely attention-based approach for blind room volume estimation, eliminating the need for CNNs and demonstrating improved accuracy in real-world conditions.
Findings
The Transformer model outperforms CNN-based models in accuracy.
Transfer learning and data augmentation enhance model performance.
The approach is effective across diverse acoustic environments.
Abstract
In recent years, dynamic parameterization of acoustic environments has raised increasing attention in the field of audio processing. One of the key parameters that characterize the local room acoustics in isolation from orientation and directivity of sources and receivers is the geometric room volume. Convolutional neural networks (CNNs) have been widely selected as the main models for conducting blind room acoustic parameter estimation, which aims to learn a direct mapping from audio spectrograms to corresponding labels. With the recent trend of self-attention mechanisms, this paper introduces a purely attention-based model to blindly estimate room volumes based on single-channel noisy speech signals. We demonstrate the feasibility of eliminating the reliance on CNN for this task and the proposed Transformer architecture takes Gammatone magnitude spectral coefficients and phase…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Advanced Adaptive Filtering Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Dense Connections
