ProLAP: Probabilistic Language-Audio Pre-Training

Toranosuke Manabe; Yuchi Ishikawa; Hokuto Munakata; Tatsuya Komatsu

arXiv:2510.18423·eess.AS·October 22, 2025

ProLAP: Probabilistic Language-Audio Pre-Training

Toranosuke Manabe, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Komatsu

PDF

Open Access

TL;DR

ProLAP introduces a probabilistic approach to language-audio pre-training that models the inherent many-to-many relationships using probability distributions, improving semantic understanding and retrieval performance.

Contribution

It presents a novel probabilistic framework with hierarchical and mask-based objectives, enabling effective learning of semantic hierarchies from small datasets.

Findings

01

Outperforms deterministic models on audio-text retrieval

02

Captures semantic hierarchies effectively

03

Works well with small datasets

Abstract

Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is inherently many-to-many: one audio segment can be described by multiple captions and vice versa. To address this, we propose Probabilistic Language-Audio Pre-training (ProLAP), which models multiplicity as the spread of probability distributions in a joint language-audio embedding space. To train the intra-modal hierarchical relationship effectively, we also introduce two objectives: (i) hierarchical inclusion loss to promote semantic hierarchical understanding of inputs and (ii) mask repulsive loss to improve the efficiency of learning when optimizing the hierarchical inclusion loss. With this training strategy, our model can learn the hierarchical structure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis