Attention as a Perspective for Learning Tempo-invariant Audio Queries
Matthias Dorfer, Jan Haji\v{c} Jr., Gerhard Widmer

TL;DR
This paper introduces a soft attention mechanism in audio--sheet music retrieval models to improve tempo-invariance, enabling more accurate retrieval across performances with varying tempos.
Contribution
It proposes a novel attention-based approach to address tempo variability in audio query retrieval, enhancing model robustness and performance.
Findings
Attention improves retrieval accuracy.
Model behavior aligns with musical intuition.
Empirical results show performance gains.
Abstract
Current models for audio--sheet music retrieval via multimodal embedding space learning use convolutional neural networks with a fixed-size window for the input audio. Depending on the tempo of a query performance, this window captures more or less musical content, while notehead density in the score is largely tempo-independent. In this work we address this disparity with a soft attention mechanism, which allows the model to encode only those parts of an audio excerpt that are most relevant with respect to efficient query codes. Empirical results on classical piano music indicate that attention is beneficial for retrieval performance, and exhibits intuitively appealing behavior.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
