ProLAP: Probabilistic Language-Audio Pre-Training
Toranosuke Manabe, Yuchi Ishikawa, Hokuto Munakata, Tatsuya Komatsu

TL;DR
ProLAP introduces a probabilistic approach to language-audio pre-training that models the inherent many-to-many relationships using probability distributions, improving semantic understanding and retrieval performance.
Contribution
It presents a novel probabilistic framework with hierarchical and mask-based objectives, enabling effective learning of semantic hierarchies from small datasets.
Findings
Outperforms deterministic models on audio-text retrieval
Captures semantic hierarchies effectively
Works well with small datasets
Abstract
Language-audio joint representation learning frameworks typically depend on deterministic embeddings, assuming a one-to-one correspondence between audio and text. In real-world settings, however, the language-audio relationship is inherently many-to-many: one audio segment can be described by multiple captions and vice versa. To address this, we propose Probabilistic Language-Audio Pre-training (ProLAP), which models multiplicity as the spread of probability distributions in a joint language-audio embedding space. To train the intra-modal hierarchical relationship effectively, we also introduce two objectives: (i) hierarchical inclusion loss to promote semantic hierarchical understanding of inputs and (ii) mask repulsive loss to improve the efficiency of learning when optimizing the hierarchical inclusion loss. With this training strategy, our model can learn the hierarchical structure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
