Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Bing Han; Chushu Zhou; Yifan Yang; Wei Wang; Chenda Li; Wangyou Zhang; Yanmin Qian

arXiv:2601.21612·eess.AS·January 30, 2026

Representation-Regularized Convolutional Audio Transformer for Audio Understanding

Bing Han, Chushu Zhou, Yifan Yang, Wei Wang, Chenda Li, Wangyou Zhang, Yanmin Qian

PDF

Open Access

TL;DR

This paper introduces the Convolutional Audio Transformer (CAT), a hierarchical, efficient SSL framework that captures multi-resolution audio features and aligns representations with pre-trained encoders, significantly improving audio understanding performance.

Contribution

The paper proposes CAT, a novel hierarchical audio transformer with a multi-resolution block and a representation regularization objective, enhancing efficiency and modeling of complex audio signals.

Findings

01

Outperforms baselines on audio benchmarks

02

Achieves 5x faster convergence on AudioSet 20k

03

Effectively models diverse temporal and spectral structures

Abstract

Bootstrap-based Self-Supervised Learning (SSL) has achieved remarkable progress in audio understanding. However, existing methods typically operate at a single level of granularity, limiting their ability to model the diverse temporal and spectral structures inherent in complex audio signals. Furthermore, bootstrapping representations from scratch is computationally expensive, often requiring extensive training to converge. In this work, we propose the Convolutional Audio Transformer (CAT), a unified framework designed to address these challenges. First, to capture hierarchical audio features, CAT incorporates a Multi-resolution Block that aggregates information across varying granularities. Second, to enhance training efficiency, we introduce a Representation Regularization objective. Drawing inspiration from generative modeling, this auxiliary task guides the student model by aligning…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing