Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for   Speech, Music, and Sound

Andros Tjandra; Yi-Chiao Wu; Baishan Guo; John Hoffman; Brian Ellis,; Apoorv Vyas; Bowen Shi; Sanyuan Chen; Matt Le; Nick Zacharov; Carleigh Wood,; Ann Lee; Wei-Ning Hsu

arXiv:2502.05139·cs.SD·February 10, 2025

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

Andros Tjandra, Yi-Chiao Wu, Baishan Guo, John Hoffman, Brian Ellis,, Apoorv Vyas, Bowen Shi, Sanyuan Chen, Matt Le, Nick Zacharov, Carleigh Wood,, Ann Lee, Wei-Ning Hsu

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper introduces a unified, automated approach for assessing audio aesthetics across speech, music, and sound, using new annotation guidelines and no-reference models that outperform existing methods and are openly available.

Contribution

It presents a novel annotation framework and no-reference prediction models for audio aesthetics, enabling consistent, automated quality assessment across diverse audio types.

Findings

01

Models achieve performance comparable or superior to human MOS scores.

02

The approach is applicable to speech, music, and sound, demonstrating versatility.

03

Open-source code and datasets support future research and benchmarking.

Abstract

The quantification of audio aesthetics remains a complex challenge in audio processing, primarily due to its subjective nature, which is influenced by human perception and cultural context. Traditional methods often depend on human listeners for evaluation, leading to inconsistencies and high resource demands. This paper addresses the growing need for automated systems capable of predicting audio aesthetics without human intervention. Such systems are crucial for applications like data filtering, pseudo-labeling large datasets, and evaluating generative audio models, especially as these models become more sophisticated. In this work, we introduce a novel approach to audio aesthetic evaluation by proposing new annotation guidelines that decompose human listening perspectives into four distinct axes. We develop and train no-reference, per-item prediction models that offer a more nuanced…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

facebookresearch/audiobox-aesthetics
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic Technology and Sound Studies · Music and Audio Processing · Multisensory perception and integration