UniWhisper: Efficient Continual Multi-task Training for Robust Universal Audio Representation
Yuxuan Chen, Peize He, Haoyuan Yu, Junzi Zhang

TL;DR
UniWhisper introduces an efficient multi-task training framework that creates a universal audio encoder capable of handling diverse audio tasks with improved performance over existing models.
Contribution
It presents a novel instruction-based training method for universal audio representation that eliminates task-specific components and demonstrates superior multi-domain performance.
Findings
Achieves higher normalized weighted averages than Whisper on 20 tasks.
Retains strong speech performance while improving environmental and music task results.
Utilizes 38k hours of public audio data for training.
Abstract
A universal audio representation should capture fine-grained speech cues and high-level semantics for environmental sounds and music in a single encoder. Existing encoders often excel in one domain but degrade in others. We propose UniWhisper, an efficient continual multi-task training framework that casts heterogeneous audio tasks into a unified instruction and answer format. This enables standard next-token training without task-specific heads and losses. We train it on 38k hours of public audio and assess the encoder using shallow MLP probes and k-nearest neighbors (kNN) on 20 tasks spanning speech, environmental sound, and music. UniWhisper reaches normalized weighted averages of 0.81 with MLP probes and 0.61 with kNN, compared to 0.64 and 0.46 for Whisper, while retaining strong speech performance.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
