Human-CLAP: Human-perception-based contrastive language-audio pretraining

Taisei Takano; Yuki Okamoto; Yusuke Kanamori; Yuki Saito; Ryotaro Nagase; Hiroshi Saruwatari

arXiv:2506.23553·eess.AS·March 11, 2026

Human-CLAP: Human-perception-based contrastive language-audio pretraining

Taisei Takano, Yuki Okamoto, Yusuke Kanamori, Yuki Saito, Ryotaro Nagase, Hiroshi Saruwatari

PDF

Open Access

TL;DR

This paper introduces Human-CLAP, a contrastive language-audio model trained on human subjective scores, significantly improving the correlation between model scores and human perception in audio-text relevance evaluation.

Contribution

The paper proposes Human-CLAP, a novel contrastive model trained with human subjective scores, enhancing the alignment between model metrics and human perception.

Findings

01

Human-CLAP increases SRCC with subjective scores by over 0.25.

02

CLAPScore has low correlation with human subjective evaluation.

03

Human-CLAP improves the reliability of audio-text relevance assessment.

Abstract

Contrastive language-audio pretraining (CLAP) is widely used for audio generation and recognition tasks. For example, CLAPScore, which utilizes the similarity of CLAP embeddings, has been a major metric for the evaluation of the relevance between audio and text in text-to-audio. However, the relationship between CLAPScore and human subjective evaluation scores is still unclarified. We show that CLAPScore has a low correlation with human subjective evaluation scores. Additionally, we propose a human-perception-based CLAP called Human-CLAP by training a contrastive language-audio model using the subjective evaluation score. In our experiments, the results indicate that our Human-CLAP improved the Spearman's rank correlation coefficient (SRCC) between the CLAPScore and the subjective evaluation scores by more than 0.25 compared with the conventional CLAP.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing