CLASP: Contrastive Language-Speech Pretraining for Multilingual Multimodal Information Retrieval
Mohammad Mahdi Abootorabi, Ehsaneddin Asgari

TL;DR
CLASP is a novel multilingual, multimodal pretraining framework that improves audio-text retrieval by integrating speech and language models across diverse languages and categories, surpassing traditional methods.
Contribution
The paper introduces CLASP, a unified contrastive pretraining approach for multilingual speech and text, utilizing a new speech-text dataset and outperforming existing retrieval methods.
Findings
Achieves state-of-the-art HITS@1, MRR, and meanR metrics across multiple languages.
Outperforms traditional ASR-based retrieval methods in multilingual scenarios.
Demonstrates effectiveness in diverse categories from fiction to religion.
Abstract
This study introduces CLASP (Contrastive Language-Speech Pretraining), a multilingual, multimodal representation tailored for audio-text information retrieval. CLASP leverages the synergy between spoken content and textual data. During training, we utilize our newly introduced speech-text dataset, which encompasses 15 diverse categories ranging from fiction to religion. CLASP's audio component integrates audio spectrograms with a pre-trained self-supervised speech model, while its language encoding counterpart employs a sentence encoder pre-trained on over 100 languages. This unified lightweight model bridges the gap between various modalities and languages, enhancing its effectiveness in handling and retrieving multilingual and multimodal data. Our evaluations across multiple languages demonstrate that CLASP establishes new benchmarks in HITS@1, MRR, and meanR metrics, outperforming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems
