ParaCLAP -- Towards a general language-audio model for computational   paralinguistic tasks

Xin Jing; Andreas Triantafyllopoulos; Bj\"orn Schuller

arXiv:2406.07203·cs.SD·June 12, 2024

ParaCLAP -- Towards a general language-audio model for computational paralinguistic tasks

Xin Jing, Andreas Triantafyllopoulos, Bj\"orn Schuller

PDF

Open Access

TL;DR

ParaCLAP introduces a novel CLAP-style model tailored for computational paralinguistic tasks, leveraging a new audio-language query creation process to improve generalization and outperform existing models.

Contribution

The paper presents ParaCLAP, a new approach for training CLAP-style models for CP tasks, including a novel dataset creation process for audio-language queries.

Findings

01

ParaCLAP surpasses state-of-the-art models on CP tasks.

02

The new query creation process improves model generalization.

03

ParaCLAP demonstrates effective transferability across diverse paralinguistic tasks.

Abstract

Contrastive language-audio pretraining (CLAP) has recently emerged as a method for making audio analysis more generalisable. Specifically, CLAP-style models are able to `answer' a diverse set of language queries, extending the capabilities of audio models beyond a closed set of labels. However, CLAP relies on a large set of (audio, query) pairs for pretraining. While such sets are available for general audio tasks, like captioning or sound event detection, there are no datasets with matched audio and text queries for computational paralinguistic (CP) tasks. As a result, the community relies on generic CLAP models trained for general audio with limited success. In the present study, we explore training considerations for ParaCLAP, a CLAP-style model suited to CP, including a novel process for creating audio-language queries. We demonstrate its effectiveness on a set of computational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Music and Audio Processing