Descriptor-Injected Cross-Modal Learning: A Systematic Exploration of Audio-MIDI Alignment via Spectral and Melodic Features
Mariano Fern\'andez M\'endez

TL;DR
This paper systematically explores how hand-crafted spectral and melodic features can improve cross-modal audio-MIDI retrieval, demonstrating significant performance gains and insights into feature importance and model alignment.
Contribution
It introduces descriptor injection into cross-modal models, evaluates various configurations, and proposes reverse cross-attention to enhance audio-MIDI alignment.
Findings
Best configuration achieves 84.0% mean similarity, 8.8% above baseline.
Octave-band energy dynamics (A4) significantly influence model performance.
Descriptors increase transformer layer alignment, indicating better representational convergence.
Abstract
Cross-modal retrieval between audio recordings and symbolic music representations (MIDI) remains challenging because continuous waveforms and discrete event sequences encode different aspects of the same performance. We study descriptor injection, the augmentation of modality-specific encoders with hand-crafted domain features, as a bridge across this gap. In a three-phase campaign covering 13 descriptor-mechanism combinations, 6 architectural families, and 3 training schedules, the best configuration reaches a mean S of 84.0 percent across five independent seeds, improving the descriptor-free baseline by 8.8 percentage points. Causal ablation shows that the audio descriptor A4, based on octave-band energy dynamics, drives the gain in the top dual models, while the MIDI descriptor D4 has only a weak inference-time effect despite improving training dynamics. We also introduce reverse…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
