Myna: Masking-Based Contrastive Learning of Musical Representations
Ori Yonay, Tracy Hammond, Tianbao Yang

TL;DR
Myna introduces a contrastive learning approach using a Vision Transformer and token masking for self-supervised musical representation learning, achieving state-of-the-art results with high efficiency and pitch sensitivity.
Contribution
The paper presents a novel masking-based data augmentation strategy and a ViT backbone for self-supervised musical representation learning, enabling larger batch sizes and improved performance.
Findings
Outperforms prior models like MULE and rivals larger models like MERT-95M.
Enables training with significantly larger batch sizes (up to 4096).
Achieves state-of-the-art results on musical tasks using only a single GPU.
Abstract
We present Myna, a simple yet effective approach for self-supervised musical representation learning. Built on a contrastive learning framework, Myna introduces two key innovations: (1) the use of a Vision Transformer (ViT) on mel-spectrograms as the backbone and (2) a novel data augmentation strategy, token masking, that masks 90 percent of spectrogram tokens. These innovations deliver both effectiveness and efficiency: (i) Token masking enables a significant increase in per-GPU batch size, from 48 or 120 in prior methods (CLMR, MULE) to 4096. (ii) By avoiding traditional augmentations, Myna retains pitch sensitivity, enhancing performance in tasks like key detection. (iii) The use of vertical patches allows the model to better capture critical features for key detection. Our hybrid model, Myna-22M-Hybrid, processes both 16x16 and 128x2 patches, achieving state-of-the-art results.…
Peer Reviews
Decision·Submitted to ICLR 2026
The proposed use of ViT and token masking seems promising in music representation learning. The paper is easy-to-read and the illustration of the proposed method, experimental design, and results seems promising.
The proposed method seems to be only applicable to the clip-level MIR tasks. I wonder the authors opinion (maybe discussions) on how the proposed architecture can be applied towards frame-level tasks as well.
1. The mask-only approach is simple and allows single-GPU large-batch training (batch size 4096), which translates to an 85x increase in efficiency over traditional contrastive methods like CLMR. The model achieves competitive average scores (68.6 for Myna-Hybrid) with MERT-95M, and surpasses public baselines like MERT-95M-public and MULE. 2. The hybrid patch design improves key detection (achieving SOTA among self-supervised methods) by integrating frequency-sensitive vertical patches. The met
1. Table 1 mixes public and private data baselines (e.g., MERT-330M) without transparently clarifying the training resource budgets. 2. The claim that "90% masking performs best" is not strongly supported by Figure 4. This is due to two issues: (a) Performance differences across high masking ratios looks marginal and lack verification of statistical significance; (b) The "average across all four benchmarks" curve can be mathematically unrigorous as it combines different metrics from different t
- Good performance - Parameter-efficient - Trained on a public dataset only - The proposed method is simple
- Limited novelty: Some core changes such as using ViT and masked autoencoder are already proposed in other, similar work including audio domain. - Although the performance is strong, the margin is rather reasonable, not outstanding.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Neuroscience and Music Perception
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Layer Normalization · Residual Connection · Dense Connections · Multi-Head Attention · Position-Wise Feed-Forward Layer · Contrastive Learning · Adam
