AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers
Kohei Yamamoto, Kosuke Okusa

TL;DR
AaSP introduces an aliasing-aware pre-training framework for audio spectrogram transformers, improving stability and performance by adaptively analyzing subbands and integrating alias-prone features.
Contribution
The paper proposes AaSP, a novel aliasing-aware self-supervised learning method that enhances audio spectrogram transformer representations by adaptive subband analysis and aliasing mitigation.
Findings
Achieves state-of-the-art results on AS-20K, ESC-50, and NSynth benchmarks.
Learns more stable representations under aliasing-sensitive temporal perturbations.
Shows competitive performance on various audio recognition tasks.
Abstract
Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while na\"ive low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
