AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences

Minoru Kishi; Ryosuke Sakai; Shinnosuke Takamichi; Yusuke Kanamori; Yuki Okamoto

arXiv:2507.00475·cs.SD·July 2, 2025

AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio embedding Sequences

Minoru Kishi, Ryosuke Sakai, Shinnosuke Takamichi, Yusuke Kanamori, Yuki Okamoto

PDF

Open Access 1 Video

TL;DR

This paper introduces AudioBERTScore, an objective metric for evaluating environmental sound synthesis quality by measuring embedding similarity, which correlates better with human subjective assessments than traditional metrics.

Contribution

The paper presents a novel objective evaluation metric, AudioBERTScore, that leverages embedding similarity and p-norms to better reflect environmental sound quality compared to existing methods.

Findings

01

AudioBERTScore shows higher correlation with subjective evaluations.

02

The method effectively captures non-local features of environmental sounds.

03

Experimental results outperform conventional metrics.

Abstract

We propose a novel objective evaluation metric for synthesized audio in text-to-audio (TTA), aiming to improve the performance of TTA models. In TTA, subjective evaluation of the synthesized sound is an important, but its implementation requires monetary costs. Therefore, objective evaluation such as mel-cepstral distortion are used, but the correlation between these objective metrics and subjective evaluation values is weak. Our proposed objective evaluation metric, AudioBERTScore, calculates the similarity between embedding of the synthesized and reference sounds. The method is based not only on the max-norm used in conventional BERTScore but also on the $p$ -norm to reflect the non-local nature of environmental sounds. Experimental results show that scores obtained by the proposed method have a higher correlation with subjective evaluation values than conventional metrics.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

AudioBERTScore: Objective Evaluation of Environmental Sound Synthesis Based on Similarity of Audio Embedding Sequences· underline

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies