SPEAR: Receiver-to-Receiver Acoustic Neural Warping Field
Yuhang He, Shitong Xu, Jia-Xing Zhong, Sangyun Shin, Niki Trigoni,, Andrew Markham

TL;DR
SPEAR introduces a neural warping field for predicting spatial acoustic effects in 3D space, enabling efficient and physically meaningful acoustic modeling from simple recordings, with applications in robotics.
Contribution
The paper proposes a novel receiver-to-receiver acoustic warping method that does not require prior knowledge of space acoustics, improving data accessibility and physical interpretability.
Findings
SPEAR outperforms traditional models on synthetic and real-world datasets.
The warping field is proven to exist uniquely when an audio source is present.
Physical principles guide the network to learn meaningful acoustic warping.
Abstract
We present SPEAR, a continuous receiver-to-receiver acoustic neural warping field for spatial acoustic effects prediction in an acoustic 3D space with a single stationary audio source. Unlike traditional source-to-receiver modelling methods that require prior space acoustic properties knowledge to rigorously model audio propagation from source to receiver, we propose to predict by warping the spatial acoustic effects from one reference receiver position to another target receiver position, so that the warped audio essentially accommodates all spatial acoustic effects belonging to the target position. SPEAR can be trained in a data much more readily accessible manner, in which we simply ask two robots to independently record spatial audio at different positions. We further theoretically prove the universal existence of the warping field if and only if one audio source presents. Three…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The receiver-to-receiver formulation of spatial acoustics is innovative, providing a new paradigm in spatial audio modeling that does not rely on prior knowledge of acoustic properties. 2. Methodologically rigorous, supported by a blend of theoretical analysis and robust experimental results across diverse datasets (synthetic, photo-realistic, real-world). 3. The overall flow and structure of the paper are clear, with detailed explanations of each stage in the model’s development, including p
1. Sampling Density Requirement: A dense sampling of receiver positions is currently required for SPEAR to achieve optimal accuracy. This requirement may limit its scalability in highly variable environments. 2. Positioning Constraint: SPEAR assumes all receiver positions lie on the same horizontal plane, which could restrict applications in multi-level or irregular environments. Addressing this limitation would extend the model’s utility.
* The method can be applied relatively easily since it does not require a lot of knowledge about the environment * It is relatively original since most other methods require more information about the environment or a more complex recording setup * Quality is somewhat unclear (see comments below) * Clarity and significance could be improved (see comments below)
* The method estimates the ratio between two transfer functions and this is prone to ill conditioning - however, it is not fully addressed. In the experiments they simply use clipping and zeroing to handle such cases, but how this affects performance is not clear * The strengths and weaknesses of the method are not sufficiently clear * I’m finding some difficulties in understanding exactly what SPEAR is trying to learn. In line 304: “...we can obtain the ground truth warping field…” - if you can
The paper presents a novel viewpoint that learns the wrapping field that connects two receivers using a new transfomer-based model. A comprehensive experimental study is conducted with both simulated and real-world data, and different aspects of the proposed method are examined. The paper is clearly written and contains meaningful illustrations that clarify its core ideas.
The usefulness of the proposed method is not well motivated. On the one hand, the wrapping field corresponds to a fixed source position, but in reality it is more informative to consider a source that can change locations. On the other hand, the space of all possible wrapping fields seems to be unnecessarily large, since it is defined by two receiver locations, whereas for RIR estimation the mapping is a function a single receiver only (and a source position). This is also evident from the vast
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Dropout · Byte Pair Encoding · Adam · Dense Connections · Softmax
