Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Amartyaveer; Murali Kadambi; Chandra Mohan Sharma; Anupam Mondal; Prasanta Kumar Ghosh

arXiv:2602.15484·eess.AS·March 11, 2026

Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction

Amartyaveer, Murali Kadambi, Chandra Mohan Sharma, Anupam Mondal, Prasanta Kumar Ghosh

PDF

Open Access

TL;DR

This paper introduces a bottleneck transformer model that improves the prediction of the STOI speech intelligibility metric without needing clean reference speech, outperforming existing methods in accuracy.

Contribution

The study presents a novel bottleneck transformer architecture with convolutional and self-attention components for nonintrusive STOI score prediction, enhancing performance over prior models.

Findings

01

Higher correlation with true STOI scores

02

Lower mean squared error in predictions

03

Effective on both seen and unseen data scenarios

Abstract

In this study, we have presented a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture. Traditional methods for calculating STOI typically requires clean reference speech, which limits their applicability in the real world. To address this, numerous deep learning-based nonintrusive speech assessment models have garnered significant interest. Many studies have achieved commendable performance, but there is room for further improvement. We propose the use of bottleneck transformer, incorporating convolution blocks for learning frame-level features and a multi-head self-attention (MHSA) layer to aggregate the information. These components enable the transformer to focus on the key aspects of the input data. Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders