SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Xiaoyu Yang; Yifan Yang; Zengrui Jin; Ziyun Cui; Wen Wu; Baoxiang Li; Chao Zhang; Phil Woodland

arXiv:2510.25955·eess.AS·March 5, 2026

SPEAR: A Unified SSL Framework for Learning Speech and Audio Representations

Xiaoyu Yang, Yifan Yang, Zengrui Jin, Ziyun Cui, Wen Wu, Baoxiang Li, Chao Zhang, Phil Woodland

PDF

7 Models 3 Reviews

TL;DR

SPEAR introduces a unified self-supervised learning framework that effectively combines speech and general audio representations into a single model, outperforming existing models on multiple benchmarks.

Contribution

SPEAR is the first framework to distill knowledge from separate speech and audio teachers into one unified model using multi-codebook vector quantisation and novel training strategies.

Findings

01

Outperforms existing models on SUPERB benchmark

02

Achieves state-of-the-art on 12 of 15 SUPERB tasks

03

Shows competitive performance on HEAR benchmark

Abstract

Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper is written well and contains detailed experiments. The experimental analysis is comprehensive. The student model achieves strong performance on many tasks, even outperforming the teacher model, such as WavLM, on some tasks.

Weaknesses

The paper has limited novelty. Multi-codebook vector quantization was proposed for knowledge distillation, and distilling from modality-specific teachers to build a unified model has been done before, e.g., USAD.

Reviewer 02Rating 8Confidence 4

Strengths

The article is very clear, shows an interesting way of merging speech and audio representations. The structure is linear, and the experiments show in great detail the comparison to existing models through known benchmarks.

Weaknesses

A few weak points can be identified still: This article proposes multiple factors of improvement: the combination of speech and audio teacher models, the use of MVQ representations to jointly represent them, and the use of WavLM and Dasheng as teachers, on an architecture different than the Wav2Vec2.0 framework (namely filterbanks + Zipformer). First: One or multiple ablation studies would have been interesting, to better judge which aspects of the pipeline have the greatest impact. Second: One

Reviewer 03Rating 6Confidence 5

Strengths

1. **Originality.** Uses MVQ tokens from teacher models to supply fine-grained discrete supervision for both speech and audio, contrasting with k-means/RPQ in prior speech SSL, and introduces an asymmetrical dual-domain loss. 2. **Quality.** Thorough empirical study: LibriSpeech ASR (RNN-T and CTC), AudioSet tagging, SUPERB, HEAR; dual-domain gains and scaling trends; targeted ablations (teacher, codebooks, dual-domain strategy). 3. **Clarity.** MVQ formulation and encoding/decoding are clearl

Weaknesses

1. **Compute transparency.** The paper lists steps and batch sizes but provides no hardware or GPU hours accounting for pre-training and fine-tuning. For a 600M dual-domain model on up to 197k hours, the compute cost should be reported. 2. **Attribution of gains.** While Appendix G ablates several factors, the main text does not clearly isolate the source of gains (teacher strength vs. MVQ vs. Zipformer vs. data scale). A controlled comparison using the same encoder and data but different quant

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.