GLAP: General contrastive audio-text pretraining across domains and languages

Heinrich Dinkel; Zhiyong Yan; Tianzi Wang; Yongqing Wang; Xingwei Sun; Yadong Niu; Jizhong Liu; Gang Li; Junbo Zhang; Jian Luan

arXiv:2506.11350·cs.SD·January 21, 2026

GLAP: General contrastive audio-text pretraining across domains and languages

Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, Jian Luan

PDF

Open Access 1 Repo 1 Models

TL;DR

GLAP is a multilingual, multi-domain audio-text pretraining model that significantly improves retrieval and classification tasks across languages and sound types, surpassing existing methods.

Contribution

Introduces GLAP, a novel multilingual and multi-domain contrastive pretraining framework for audio and text, extending CLAP capabilities.

Findings

01

Achieves competitive results on Clotho and AudioCaps benchmarks.

02

Surpasses existing methods in speech retrieval and classification.

03

Excels in zero-shot sound-event recognition and multilingual keyword spotting.

Abstract

Contrastive Language Audio Pretraining (CLAP) is a widely-used method to bridge the gap between audio and text domains. Current CLAP methods enable sound and music retrieval in English, ignoring multilingual spoken content. To address this, we introduce general language audio pretraining (GLAP), which expands CLAP with multilingual and multi-domain abilities. GLAP demonstrates its versatility by achieving competitive performance on standard audio-text retrieval benchmarks like Clotho and AudioCaps, while significantly surpassing existing methods in speech retrieval and classification tasks. Additionally, GLAP achieves strong results on widely used sound-event zero-shot benchmarks, while simultaneously outperforming previous methods on speech content benchmarks. Further keyword spotting evaluations across 50 languages emphasize GLAP's advanced multilingual capabilities. Finally,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiaomi-research/dasheng-glap
jaxOfficial

Models

🤗
mispeech/GLAP
model· 43 dl· ♡ 5
43 dl♡ 5

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Phonetics and Phonology Research