ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

Ameenudeen P E; Charumathi Narayanan; and Sriram Ganapathy

arXiv:2604.06702·eess.AS·April 9, 2026

ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

Ameenudeen P E, Charumathi Narayanan, and Sriram Ganapathy

PDF

TL;DR

ULTRAS introduces a transformer-based framework that jointly learns audio and speech representations by predicting masked spectral patches, improving performance across various speech and audio tasks.

Contribution

It presents a unified transformer-based learning framework that effectively encodes both time and frequency traits in audio and speech signals.

Findings

01

ULTRAS outperforms established baselines on multiple speech and audio tasks.

02

The model effectively encodes spectral and temporal features.

03

Joint learning improves transferability between speech and general audio representations.

Abstract

Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.