Neural Fourier Shift for Binaural Speech Rendering
Jin Woo Lee, Kyogu Lee

TL;DR
This paper introduces Neural Fourier Shift, a novel neural network architecture that efficiently renders binaural speech from monaural audio by modeling delays and reflections in Fourier space, improving generalization and reducing computational costs.
Contribution
The paper proposes Neural Fourier Shift, a Fourier space-based neural network for binaural speech rendering that is more efficient, interpretable, and domain-independent than previous methods.
Findings
Performs comparably to previous methods on benchmark datasets.
Uses 25 times less memory and 6 times fewer computations.
Operates effectively on out-of-distribution data.
Abstract
We present a neural network for rendering binaural speech from given monaural audio, position, and orientation of the source. Most of the previous works have focused on synthesizing binaural speeches by conditioning the positions and orientations in the feature space of convolutional neural networks. These synthesis approaches are powerful in estimating the target binaural speeches even for in-the-wild data but are difficult to generalize for rendering the audio from out-of-distribution domains. To alleviate this, we propose Neural Fourier Shift (NFS), a novel network architecture that enables binaural speech rendering in the Fourier space. Specifically, utilizing a geometric time delay based on the distance between the source and the receiver, NFS is trained to predict the delays and scales of various early reflections. NFS is efficient in both memory and computational cost, is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
