ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

Junyu Wang; Tianrui Wang; Meng Ge; Longbiao Wang; Jianwu Dang

arXiv:2507.02666·cs.SD·July 4, 2025

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang

PDF

TL;DR

This paper introduces ASDA, a differential attention mechanism for self-supervised audio representation learning that improves focus on relevant information, leading to state-of-the-art results across various audio classification benchmarks.

Contribution

The paper proposes a novel differential attention mechanism with dual-softmax and differential coefficients, enhancing the discriminative ability of Transformer-based models in audio tasks.

Findings

01

Achieves SOTA performance on multiple audio benchmarks

02

Effectively reduces irrelevant attention in Transformer models

03

Improves discriminative ability in self-supervised audio learning

Abstract

In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.