WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

Gokul Karthik Kumar; Ludovick Lepauloux; Hakim Hacid

arXiv:2601.15118·cs.SD·January 23, 2026

WavLink: Compact Audio-Text Embeddings with a Global Whisper Token

Gokul Karthik Kumar, Ludovick Lepauloux, Hakim Hacid

PDF

Open Access

TL;DR

WavLink introduces a compact audio-text embedding model that enhances Whisper with a global token, achieving state-of-the-art retrieval and classification performance while significantly reducing embedding size.

Contribution

The paper presents WavLink, a novel approach that combines Whisper with a learnable global token and a two-stage training process for efficient, high-performance audio-text embeddings.

Findings

01

Achieves state-of-the-art retrieval performance.

02

Enables 8x smaller embeddings with minimal performance loss.

03

Demonstrates competitive results on AIR-Bench tasks.

Abstract

Whisper has become the de-facto encoder for extracting general-purpose audio features in large audio-language models, where a 30-second clip is typically represented by 1500 frame features projected into an LLM. In contrast, audio-text embedding models like CLAP-based models have largely relied on alternative audio encoders (e.g., HTS-AT, PaSST), and have not leveraged Whisper effectively. We present WavLink, a compact audio-text embedding model that augments Whisper encoder with a learnable global token, trained jointly with a text encoder. Through a systematic study of design choices, including pretrained text encoders, loss functions, training modes, and data mixtures, we identify configurations that yield state-of-the-art retrieval performance. Our two-stage training recipe across three model sizes, combined with Matryoshka-style supervision, improves scalability, enabling 8x…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Topic Modeling