Listening to features
Manuel Moussallam, Antoine Liutkus, Laurent Daudet

TL;DR
This paper introduces a blind, exemplar-based method for synthesizing audio from low-dimensional features, addressing challenges like irregular temporal spacing and unknown feature computation methods, demonstrated on speech and song datasets.
Contribution
It proposes a novel, flexible synthesis framework that does not rely on explicit feature inversion formulas, suitable for complex datasets like the Million Song Dataset.
Findings
Successful synthesis of speech from known features.
Application to inverting songs from the Million Song Dataset.
Framework handles irregular temporal features and black-box feature computation.
Abstract
This work explores nonparametric methods which aim at synthesizing audio from low-dimensionnal acoustic features typically used in MIR frameworks. Several issues prevent this task to be straightforwardly achieved. Such features are designed for analysis and not for synthesis, thus favoring high-level description over easily inverted acoustic representation. Whereas some previous studies already considered the problem of synthesizing audio from features such as Mel-Frequency Cepstral Coefficients, they mainly relied on the explicit formula used to compute those features in order to inverse them. Here, we instead adopt a simple blind approach, where arbitrary sets of features can be used during synthesis and where reconstruction is exemplar-based. After testing the approach on a speech synthesis from well known features problem, we apply it to the more complex task of inverting songs from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
