PHALAR: Phasors for Learned Musical Audio Representations

Davide Marincione; Michele Mancusi; Giorgio Strano; Luca Cerovaz; Donato Crisostomi; Roberto Ribuoli; Emanuele Rodol\`a

arXiv:2605.03929·cs.SD·May 12, 2026

PHALAR: Phasors for Learned Musical Audio Representations

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodol\`a

PDF

TL;DR

PHALAR is a novel contrastive framework that significantly improves stem retrieval accuracy in musical audio by leveraging spectral pooling and complex-valued representations, capturing musical structures effectively.

Contribution

It introduces PHALAR, a new model with spectral pooling and complex-valued heads, achieving state-of-the-art results in musical stem retrieval and structural understanding.

Findings

01

Achieves up to 70% accuracy improvement over previous models.

02

Requires less than 50% of the parameters and is 7 times faster to train.

03

Correlates better with human judgments and captures musical structures in zero-shot tasks.

Abstract

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70%$ over the state-of-the-art while requiring $< 50%$ of the parameters and a 7 $\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.