TL;DR
SPEAR introduces a unified self-supervised learning framework that effectively combines speech and general audio representations into a single model, outperforming existing models on multiple benchmarks.
Contribution
SPEAR is the first framework to distill knowledge from separate speech and audio teachers into one unified model using multi-codebook vector quantisation and novel training strategies.
Findings
Outperforms existing models on SUPERB benchmark
Achieves state-of-the-art on 12 of 15 SUPERB tasks
Shows competitive performance on HEAR benchmark
Abstract
Self-supervised learning (SSL) has significantly advanced acoustic representation learning. However, most existing models are optimised for either speech or audio event understanding, resulting in a persistent gap between these two domains. We address this gap with SPEAR (SPEech and Audio Representations), a self-supervised framework that distils complementary knowledge from a speech-focused SSL teacher and a general-audio SSL teacher into a single unified model. SPEAR applies multi-codebook vector quantisation to continuous teacher representations to produce fine-grained discrete tokens that capture both semantic and acoustic information. To effectively integrate these heterogeneous representations, SPEAR jointly predicts them given a masked input with an asymmetric pre-training loss. We further improve robustness in complex sound scenes through a novel token mixing mechanism.…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper is written well and contains detailed experiments. The experimental analysis is comprehensive. The student model achieves strong performance on many tasks, even outperforming the teacher model, such as WavLM, on some tasks.
The paper has limited novelty. Multi-codebook vector quantization was proposed for knowledge distillation, and distilling from modality-specific teachers to build a unified model has been done before, e.g., USAD.
The article is very clear, shows an interesting way of merging speech and audio representations. The structure is linear, and the experiments show in great detail the comparison to existing models through known benchmarks.
A few weak points can be identified still: This article proposes multiple factors of improvement: the combination of speech and audio teacher models, the use of MVQ representations to jointly represent them, and the use of WavLM and Dasheng as teachers, on an architecture different than the Wav2Vec2.0 framework (namely filterbanks + Zipformer). First: One or multiple ablation studies would have been interesting, to better judge which aspects of the pipeline have the greatest impact. Second: One
1. **Originality.** Uses MVQ tokens from teacher models to supply fine-grained discrete supervision for both speech and audio, contrasting with k-means/RPQ in prior speech SSL, and introduces an asymmetrical dual-domain loss. 2. **Quality.** Thorough empirical study: LibriSpeech ASR (RNN-T and CTC), AudioSet tagging, SUPERB, HEAR; dual-domain gains and scaling trends; targeted ablations (teacher, codebooks, dual-domain strategy). 3. **Clarity.** MVQ formulation and encoding/decoding are clearl
1. **Compute transparency.** The paper lists steps and batch sizes but provides no hardware or GPU hours accounting for pre-training and fine-tuning. For a 600M dual-domain model on up to 197k hours, the compute cost should be reported. 2. **Attribution of gains.** While Appendix G ablates several factors, the main text does not clearly isolate the source of gains (teacher strength vs. MVQ vs. Zipformer vs. data scale). A controlled comparison using the same encoder and data but different quant
Code & Models
- 🤗marcoyang/spear-xlarge-speech-audiomodel· 53k dl· ♡ 453k dl♡ 4
- 🤗marcoyang/spear-large-speechmodel· 14 dl14 dl
- 🤗marcoyang/spear-large-speech-audiomodel· 141 dl141 dl
- 🤗marcoyang/spear-base-speechmodel· 38 dl38 dl
- 🤗marcoyang/spear-base-speech-audiomodel· 6 dl· ♡ 26 dl♡ 2
- 🤗marcoyang/spear-base-speech-audio-v2model· 92 dl92 dl
- 🤗marcoyang/spear-base-speech-v2model· 133 dl133 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
