Multi-View Multi-Task Modeling with Speech Foundation Models for Speech   Forensic Tasks

Orchid Chetia Phukan; Devyani Koshal; Swarup Ranjan Behera; Arun; Balaji Buduru; Rajesh Sharma

arXiv:2410.12947·eess.AS·October 18, 2024

Multi-View Multi-Task Modeling with Speech Foundation Models for Speech Forensic Tasks

Orchid Chetia Phukan, Devyani Koshal, Swarup Ranjan Behera, Arun, Balaji Buduru, Rajesh Sharma

PDF

Open Access

TL;DR

This paper introduces a multi-view multi-task learning framework using speech foundation models for speech forensic tasks, improving performance by integrating diverse representations through a novel TANGO method.

Contribution

It proposes a multi-view learning approach with the TANGO framework to enhance multi-task speech forensic performance across multiple datasets.

Findings

01

TANGO outperforms individual SFM representations.

02

Multi-view learning improves multi-task performance.

03

The approach reduces resource requirements compared to separate models.

Abstract

Speech forensic tasks (SFTs), such as automatic speaker recognition (ASR), speech emotion recognition (SER), gender recognition (GR), and age estimation (AE), find use in different security and biometric applications. Previous works have applied various techniques, with recent studies focusing on applying speech foundation models (SFMs) for improved performance. However, most prior efforts have centered on building individual models for each task separately, despite the inherent similarities among these tasks. This isolated approach results in higher computational resource requirements, increased costs, time consumption, and maintenance challenges. In this study, we address these challenges by employing a multi-task learning strategy. Firstly, we explore the various state-of-the-art (SOTA) SFMs by extracting their representations for learning these SFTs and investigating their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis