PHALAR: Phasors for Learned Musical Audio Representations
Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodol\`a

TL;DR
PHALAR is a novel contrastive framework that significantly improves stem retrieval accuracy in musical audio by leveraging spectral pooling and complex-valued representations, capturing musical structures effectively.
Contribution
It introduces PHALAR, a new model with spectral pooling and complex-valued heads, achieving state-of-the-art results in musical stem retrieval and structural understanding.
Findings
Achieves up to 70% accuracy improvement over previous models.
Requires less than 50% of the parameters and is 7 times faster to train.
Correlates better with human judgments and captures musical structures in zero-shot tasks.
Abstract
Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to over the state-of-the-art while requiring of the parameters and a 7 training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
