# DRASP: A Dual-Resolution Attentive Statistics Pooling Framework for Automatic MOS Prediction

**Authors:** Cheng-Yeh Yang, Kuan-Tang Huang, Chien-Chun Wang, Hung-Shin Lee, Hsin-Min Wang, Berlin Chen

arXiv: 2508.21407 · 2025-09-01

## TL;DR

The paper introduces DRASP, a dual-resolution pooling framework that combines global statistics and attentive local analysis to improve speech quality prediction accuracy across diverse datasets and models.

## Contribution

It proposes a novel dual-resolution pooling method that captures both global and local speech features for better MOS prediction.

## Key findings

- Outperforms baseline methods on multiple datasets.
- Achieves 10.39% relative improvement in SRCC.
- Demonstrates strong generalization across models and audio systems.

## Abstract

A pooling mechanism is essential for mean opinion score (MOS) prediction, facilitating the transformation of variable-length audio features into a concise fixed-size representation that effectively encodes speech quality. Existing pooling methods typically operate at a singular granularity, concentrating either on a comprehensive global perspective or a detailed frame-level analysis, which may overlook complementary perceptual insights. To address this limitation, we introduce the Dual-Resolution Attentive Statistics Pooling (DRASP) framework. DRASP integrates both coarse-grained, global statistical summaries and fine-grained, attentive analyses of perceptually significant segments. This dual-view architecture empowers our model to formulate a more thorough and robust representation, capturing both the overarching structural context and salient local details concurrently. Extensive experiments validate the effectiveness and strong generalization ability of the proposed framework. It consistently outperforms various baseline methods across diverse datasets (MusicEval and AES-Natural), MOS prediction backbones (including a CLAP-based model and AudioBox-Aesthetics), and different audio generation systems, achieving a relative improvement of 10.39% in system-level Spearman's rank correlation coefficient (SRCC) over the widely-used average pooling approach.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21407/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21407/full.md

## References

29 references — full list in the complete paper: https://tomesphere.com/paper/2508.21407/full.md

---
Source: https://tomesphere.com/paper/2508.21407