PARROT: Synergizing Mamba and Attention-based SSL Pre-Trained Models via Parallel Branch Hadamard Optimal Transport for Speech Emotion Recognition

Orchid Chetia Phukan; Mohd Mujtaba Akhtar; Girish; Swarup Ranjan Behera; Jaya Sai Kiran Patibandla; Arun Balaji Buduru; Rajesh Sharma

arXiv:2506.01138·eess.AS·June 3, 2025

PARROT: Synergizing Mamba and Attention-based SSL Pre-Trained Models via Parallel Branch Hadamard Optimal Transport for Speech Emotion Recognition

Orchid Chetia Phukan, Mohd Mujtaba Akhtar, Girish, Swarup Ranjan Behera, Jaya Sai Kiran Patibandla, Arun Balaji Buduru, Rajesh Sharma

PDF

Open Access

TL;DR

This paper introduces PARROT, a novel framework that fuses Mamba-based and attention-based SSL pre-trained models using optimal transport and Hadamard product, significantly improving speech emotion recognition performance.

Contribution

The paper presents a new heterogeneous PTM fusion method combining Mamba and attention-based models with optimal transport and Hadamard product for SER.

Findings

01

Achieves state-of-the-art results in speech emotion recognition

02

Outperforms individual PTMs and homogeneous fusion methods

03

Demonstrates the effectiveness of heterogeneous PTM fusion

Abstract

The emergence of Mamba as an alternative to attention-based architectures has led to the development of Mamba-based self-supervised learning (SSL) pre-trained models (PTMs) for speech and audio processing. Recent studies suggest that these models achieve comparable or superior performance to state-of-the-art (SOTA) attention-based PTMs for speech emotion recognition (SER). Motivated by prior work demonstrating the benefits of PTM fusion across different speech processing tasks, we hypothesize that leveraging the complementary strengths of Mamba-based and attention-based PTMs will enhance SER performance beyond the fusion of homogenous attention-based PTMs. To this end, we introduce a novel framework, PARROT that integrates parallel branch fusion with Optimal Transport and Hadamard Product. Our approach achieves SOTA results against individual PTMs, homogeneous PTMs fusion, and baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Emotion and Mood Recognition

MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces