CLASP: Contrastive Language-Speech Pretraining for Multilingual   Multimodal Information Retrieval

Mohammad Mahdi Abootorabi; Ehsaneddin Asgari

arXiv:2412.13071·cs.CL·March 25, 2025

CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval

Mohammad Mahdi Abootorabi, Ehsaneddin Asgari

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

CLASP is a novel multilingual, multimodal pretraining framework that improves audio-text retrieval by integrating speech and language models across diverse languages and categories, surpassing traditional methods.

Contribution

The paper introduces CLASP, a unified contrastive pretraining approach for multilingual speech and text, utilizing a new speech-text dataset and outperforming existing retrieval methods.

Findings

01

Achieves state-of-the-art HITS@1, MRR, and meanR metrics across multiple languages.

02

Outperforms traditional ASR-based retrieval methods in multilingual scenarios.

03

Demonstrates effectiveness in diverse categories from fiction to religion.

Abstract

This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

llm-lab-org/CLASP
noneOfficial

Models

🤗
llm-lab/CLASP
model· ♡ 3
♡ 3

Datasets

llm-lab/SpeechBrown
dataset· 89 dl
89 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech and dialogue systems