Deep Learning for Personalized Binaural Audio Reproduction
Xikun Lu, Yunda Chen, Zehua Chen, Jie Wang, Mingxing Liu, Hongmei Hu, Chengshi Zheng, Stefan Bleeck, Jinqiu Sang

TL;DR
This survey reviews recent deep learning techniques for personalized binaural audio, focusing on explicit filtering and end-to-end rendering methods, datasets, evaluation metrics, and future research directions.
Contribution
It categorizes deep learning approaches into two paradigms, summarizes datasets and metrics, and discusses applications, limitations, and future research in personalized spatial audio.
Findings
Explicit methods predict personalized HRTFs from measurements or cues.
End-to-end methods directly map source to binaural signals with personalization.
The survey highlights key datasets and evaluation metrics for fair comparison.
Abstract
Personalized binaural audio reproduction is the basis of realistic spatial localization, sound externalization, and immersive listening, directly shaping user experience and listening effort. This survey reviews recent advances in deep learning for this task and organizes them by generation mechanism into two paradigms: explicit personalized filtering and end-to-end rendering. Explicit methods predict personalized head-related transfer functions (HRTFs) from sparse measurements, morphological features, or environmental cues, and then use them in the conventional rendering pipeline. End-to-end methods map source signals directly to binaural signals, aided by other inputs such as visual, textual, or parametric guidance, and they learn personalization within the model. We also summarize the field's main datasets and evaluation metrics to support fair and repeatable comparison. Finally, we…
| Method | Year | Morphological Data | Datasets | model | Data Representation |
Physiological
Parameters |
||
| Ant. | Ima. | 3D sca. | ||||||
| Hu et al. [50] | 2008 | ✓ | CIPIC | BP ANN | HRTFs | 8 | ||
| Chen et al. [48] | 2019 | ✓ | CIPIC | AEs+DNNs | AEs | 27 | ||
| Wang et al. [51] | 2020 | ✓ | HUTUBS | 1D CNN | SH Cofficients | 25 | ||
| Zhang et al. [52] | 2020 | ✓ | CIPIC | DNN | SPCA Weights | 8 | ||
| Miccini et al. [53] | 2020 | ✓ | HUTUBS | VAE+DNN | VAE & CVAE | 27 | ||
| Yao et al. [54] | 2022 | ✓ | HUTUBS | AE+DNN | VAE | 12 | ||
| Zhang et al. [55] | 2023 | ✓ | CIPIC | FCNN+Attention | HRTFs & ITDs | 8 | ||
| Sánchez et al. [56] | 2025 | ✓ | HUTUBS | DDPM | HRIRs | 27 | ||
| Lee & kim [29] | 2018 | ✓ | ✓ | CIPIC | CNN-DNN | HRTFs | 17 | |
| Miccini & Spagnol [57] | 2021 | ✓ | HUTUBS | VAE+DNN | CVAE | 3 | ||
| Zhao et al. [58] | 2022 | ✓ | ✓ | CIPIC | VGG-Ear+FC | SH Cofficients | 17 | |
| PRTFNet [59] | 2023 | ✓ | HUTUBS | CNN | PRTF | — | ||
| Javeri et al. [60] | 2023 | ✓ | HUTUBS | 3DR & ASNN | HRTFs | — | ||
| Fantini et al. [61] | 2021 | ✓ | HUTUBS | GRNN | DTF & CTF | 11 | ||
| Zhou et al. [62] | 2021 | ✓ | — | CNN-Reg/UNet-Reg | HRTFs | — | ||
| Wang et al. [63] | 2022 | ✓ | HUTUBS | CNN-FC | SH Cofficients | — | ||
| Zhao et al. [64] | 2024 | ✓ | 3D3A | CNN-FC | HRTFs | — | ||
| Abbreviations: Ant(hropometry), Ima(ge), 3D sca(n). | ||||||||
| Method | Year | Datasets | Model | Data Representation | Sparse Locations |
|---|---|---|---|---|---|
| Discrete-Domain | |||||
| Ito et al. [71] | 2022 | HUTUBS | AE + Aggregation | AEs | |
| Zandi et al. [72] | 2022 | ITA | CVAE | AEs | 60 |
| Zurale & Dubnov [73] | 2023 | BiLi | VQ-VAE + Transformer | AEs | 25 |
| Chang et al. [74] | 2025 | CIPIC | VAE+DNN | AEs | — |
| Zurale et al. [75] | 2022 | CIPIC | DCNN | HRTFs | 72,18 |
| Jiang et al. [76] | 2023 | CIPIC | U-Net | HRTFs | 3,4,6,8,12,23,105 |
| Chen et al. [77] | 2023 | HUTUBS | Spherical CNN | SH Cofficients | 120 |
| Thuillier et al. [78] | 2024 | HUTUBS | SConvCNP | SH Cofficients | |
| Zhao et al. [79] | 2025 | HUTUBS | CNN | HRTFs | 6,14,26,38,50,74,86,110,146,170 |
| Hogg et al. [80] | 2024 | ARI | SR-GAN | HRTFs | 5,20,80,320 |
| Hu et al. [81] | 2024 | SONICOM | AE-GAN | SH Cofficients | 8,12,18,27 |
| HRTF-DUNet [82] | 2025 | SONICOM | Denoisy U-Net + AE-GAN | SH Cofficients | 3,4,8,18,27 |
| Continuous-Domain | |||||
| Lee et al. [83] | 2023 | HUTUBS | FiLM + HyperConv | HRTFs | 4,8,12,16 |
| HRTF Field [84] | 2023 | HRTF Datasetsa | SIREN / IGON | HRTFs | 5%,10%,15%,20%,25% |
| Ma et al. [85] | 2023 | HUTUBS | PINN | HRTFs | 315,675 |
| NIIRF [86] | 2024 | CIPIC / HUTUBS | INR | HRTFs | 10,20,30,50,100 |
| Neural Steerer [87] | 2024 | EasyCom | SIREN | HRIRs | 15%,30%,45%,60%,75%,90% |
| RANF [88] | 2025 | SONICOM | INR | HRTFs & ITDs | 3,5,19,100 |
| BiCG [89] | 2025 | HRTF Datasetsa | IGON | ILDs & ITDs | — |
| aHRTF Datasets: RIEC, 3D3A, Aachen, ARI, BiLi, CIPIC, Crossmod, HUTUBS, Listen, and SADIE. | |||||
| Name | Year | Subjects | Positions | Distance (m) | Spatial Resolution | Sampling Scheme | Morphology |
| CIPIC [31] | 2001 | 45 | 1250 | 1.00 | Interaural-polar | Anthropometry | |
| Listen [97] | 2003 | 51 | 187 | 1.95 | Geodesic grid | Anthropometry | |
| RIEC [98] | 2014 | 105 | 865 | 1.50 | Geodesic grid | 3D meshes | |
| BiLi [99] | 2014 | 57 | 1680 | 2.06 | Geodesic grid | No | |
| ARI [100] | 2016 | 250 | 1550 | 1.20 | Geodesic grid | Anthropometry | |
| ITA [101] | 2016 | 48 | 2304 | 1.20 | Geodesic grid | Anthropometry; 3D meshes | |
| Aachen [101] | 2016 | 48 | 2304 | 1.20 | Geodesic grid | Anthropometry; 3D meshes | |
| 3D3A [102] | 2017 | 38 | 648 | 0.76 | Geodesic grid | Anthropometry; 3D meshes | |
| SADIE [103] | 2018 | 20 | 1.20 | Geodesic grid | Images; 3D meshes | ||
| OlHeaD-HRTF [104] | 2018 | 16 | 91 | 2.50-3.00 | Interaural-polar | No | |
| HUTUBS [33] | 2019 | 96 | 440 | 1.47 | Near-Lebedev | Anthropometry; 3D meshes | |
| CHEDAR [105] | 2020 | 1253 | 0.2,0.5,1.2 | Geodesic grid | Anthropometry; 3D meshes | ||
| SONICOM [106] | 2023 | 200 | 793 | 1.50 | Geodesic grid | Images; 3D meshes | |
| In this table, for geodesic and Near-Lebedev sampling schemes, and generally represent the resolution in elevation and azimuth, respectively. For the interaural-polar scheme, typically refers to the lateral angle resolution, and to the polar angle resolution. | |||||||
| Metric | Formula | Focus | Task |
| LSD [110] | Spectral Difference | Measure average log-magnitude error. | |
| SDE [62] | Spectral Difference | Measures mean absolute log-magnitude error. | |
| LRE [68, 78] | Spectral Difference | Measures relative error on log-magnitude spectra. | |
| LMD [68, 78] | Spectral Difference | Measures mean absolute log-magnitude difference. | |
| RMSE [111, 55] | Overall Difference | Measures average HRIR difference. | |
| MAE [112] | Overall Difference | Measures average HRIR difference. | |
| SDR [113, 114] | Signal-to-Error Ratio | Ratio of HRIR/HRTF signal energy to error energy. | |
| PCC [74] | Statistical Correlation | Measures linear correlation of magnitude spectra. | |
| Abbreviations: LSD: log-spectral distortion, SDE: spectral distance error, LRE: logarithmic relative error, LMD: logarithmic magnitude distance, RMSE: root mean square error, MAE: mean absolute error, SDR: signal-to-distortion ratio, PCC: Pearson correlation coefficient. The symbols used are: for the true complex HRTF at direction and frequency ; for the predicted complex HRTF; for the true HRIR at time ; for the predicted HRIR. is the number of frequency bins, the number of directions, and the number of time samples. Metrics are typically averaged over directions and/or frequencies as appropriate. Arrows ( / ) indicate desirable direction. | |||
| Method | Year | Model | Code |
| Gebru et al. [37] | 2021 | TCN | No |
| WarpNet [38] | 2021 | Warping + Temporal ConvNet | Github |
| BinauralGrad [163] | 2022 | Diffusion Model | Github |
| Huang et al. [162] | 2022 | VQ-VAE | No |
| DopplerBAS [165] | 2023 | WarpNet / BinauralGrada | No |
| NFS [168] | 2023 | — | Github |
| DIFFBAS [164] | 2024 | WarpNet / NFSb | Github |
| ZeroBAS [169] | 2024 | GTW + AS + Denoising Vocoder | No |
| Zhang et al. [167] | 2025 | TW + POSA + GCFM | No |
| DPATFNet [166] | 2025 | TDW + DPAB + MPF | No |
| aDopplerBAS considers velocity information based on WarpNet and BinauralGrad to simulate the Doppler effect. bDIFFBAS redesigns the loss function based on the models studied in WarpNet and NFS. | |||
| Method | Year | Input Modalities | Model Architectures | Feature Fusion | Training Strategy | Code | ||||
|---|---|---|---|---|---|---|---|---|---|---|
| Aud. | Vis. | Tex. | Par. | Backbone | Encoder | |||||
| Visual-guided Synthesis | ||||||||||
| Morgado et al. [147] | 2018 | ✓ | ✓ | CNN | ResNet18 | Connect | Supervised | Github | ||
| Mono2binaural [148] | 2019 | ✓ | ✓ | U-Net | ResNet18 | Connect | Self-Supervised | Github | ||
| ASN [170] | 2019 | ✓ | ✓ | U-Net | ResNet18 | Connect | Self-Supervised | No | ||
| Sep-Stereo/APNet [149] | 2020 | ✓ | ✓ | U-Net | ResNet18 | APNet | Multi-Task Learning | Github | ||
| PseudoBinaural [171] | 2021 | ✓ | ✓ | U-Net | ResNet18 | Connect | Multi-Task Learning | Github | ||
| Li et al. [172] | 2021 | ✓ | ✓ | U-Net | ResNet18 | Attention | Multi-Task Learning | No | ||
| L2BNet [173] | 2021 | ✓ | ✓ | U-Net | ResNet18 | Attention | Semi-Supervised | No | ||
| Lin et al. [174] | 2021 | ✓ | ✓ | U-Net | ResNet18 | Attention | Semi-Supervised | No | ||
| MAFNet [175] | 2021 | ✓ | ✓ | U-Net | ResNet18 | Attention | Self-Supervised | No | ||
| Bmonobinaural [176] | 2022 | ✓ | ✓ | U-Net | ViT-Large | Attention | Supervised | No | ||
| Points2Sound [177] | 2022 | ✓ | ✓ | Demucs | ResNet18 | Conditioning | Supervised | Github | ||
| Garg et al. [178] | 2023 | ✓ | ✓ | U-Net | ResNet18 | Connect | Multi-Task Learning | Github | ||
| CLUP [150] | 2024 | ✓ | ✓ | Diffusion Model | ResNet18 | — | Cyclic Learning | No | ||
| Liu et al. [179] | 2024 | ✓ | ✓ | U-Net | ResNet18 | Weighting | Contrastive Learning | No | ||
| SAGM [180] | 2024 | ✓ | ✓ | GAN | C3D | Connect | Supervised | No | ||
| CCStereo [181] | 2025 | ✓ | ✓ | U-Net | ResNet | AVAD | Contrastive Learning | No | ||
| OmniAudio [182] | 2025 | ✓ | ✓ | DiT | MetaCLIP-Huge | — | Self-Supervised | Github | ||
| AV-NeRF [151] | 2023 | ✓ | ✓ | A-NeRF | V-NeRF | AV-Mapper | Supervised | Github | ||
| NeRAF [183] | 2025 | ✓ | ✓ | NAcF | NeRF | Connect | Supervised | Github | ||
| AV-GS [152] | 2024 | ✓ | ✓ | Acoustic Field | 3D-GS | Connect | Supervised | No | ||
| AV-Cloud [184] | 2024 | ✓ | ✓ | ✓ | AVCS | AV Anchors | Attention | Supervised | Github | |
| SOAF [153] | 2024 | ✓ | ✓ | ✓ | NAF | SDFStudio | Attention | Supervised | No | |
| AV-Surf [154] | 2025 | ✓ | ✓ | Transformer | ResNet/PointNet | Attention | Supervised | No | ||
| SoundVista [185] | 2025 | ✓ | ✓ | ✓ | Transformer | ResNet18 | Attention | Supervised | No | |
| Text-guided Synthesis | ||||||||||
| TAS [155] | 2024 | ✓ | ✓ | Diffusion Model | CLIP | SAF | Supervised | No | ||
| DualSpec [156] | 2025 | ✓ | ✓ | Diffusion Model | FLAN-T5 | — | Semi-Supervised | No | ||
| AudioSpa [157] | 2025 | ✓ | ✓ | Residual block | FLAN-T5 | Attention | Supervised | No | ||
| SpatialTAS [186] | 2025 | ✓ | ✓ | Diffusion Model | FLAN-T5 | Attention | Supervised | No | ||
| Joint Multi-modal Guided Synthesis | ||||||||||
| SEE-2-SOUND [159] | 2024 | ✓ | ✓ | CoDi | ViT-H/L | — | Zero-Shot | Github | ||
| SpatialSonic [160] | 2025 | ✓ | ✓ | ✓ | HTSAT | Mask-RCNN/T5 | Attention | Supervised | Github | |
| ImmerseDiffusion [158] | 2025 | ✓ | ✓ | ✓ | DiT | ELSA / CLAP | — | Supervised | No | |
| Diff-SAGe [187] | 2025 | ✓ | SiT | — | — | Supervised | No | |||
| ViSAGe [161] | 2025 | ✓ | ✓ | Transformer | CLIP | Attention | Supervised | Github | ||
| ISDrama [188] | 2025 | ✓ | ✓ | ✓ | ✓ | Mamba-Transformer | FLAN-T5 / CLIP | Attention | Supervised | No |
| Abbreviations: Aud(io), Vis(ual), Tex(t), Par(ametric). Parameter information comprises spatial and environmental parameters. Spatial parameters define source-listener relationships through location, orientation, and distance. Environmental parameters include room dimensions and reverberation characteristics. | ||||||||||
| Dataset | Year | Type | Scene | Modality | Scale | Link | ||||
| A | V | T | M | Hours | Samples | |||||
| Audio-driven Synthesis Datasets | ||||||||||
| Binaural Speech [38] | 2021 | Real | Regular room, Treated | ✓ | ✓ | h | 8 speakers | Link | ||
| Multi-Model Guided Synthesis Datasets | ||||||||||
| REC-Street [147] | 2018 | Real | Outdoor street | ✓ | ✓ | 3.5h | 43 clips | Link | ||
| YT-Alla [147] | 2018 | Real | Real-world | ✓ | ✓ | 113.1h | 1146 clips | Link | ||
| FAIR-Play [148] | 2019 | Real | Music room | ✓ | ✓ | 5.2h | 1871 clips | Link | ||
| MUSIC-Stereo [171] | 2021 | Real | Music performance | ✓ | ✓ | 49.7h | 1,120 clips | Link | ||
| SimBinaural [178] | 2023 | Sim. | — | ✓ | ✓ | 116.1h | 21k clips | Link | ||
| YouTube-Binaural [178] | 2023 | Real | Real-world | ✓ | ✓ | 27.7h | 426 clips | Link | ||
| BEWO-1M [160] | 2025 | Sim., Real | — | ✓ | ✓ | 2.8k h | 1M samples | Link | ||
| SoundSpaces [195, 196] | 2022 | Sim. | Arbitrary 3D mesh Env. | ✓ | ✓ | — | 1600+ scenes | Link | ||
| GWA [197] | 2022 | Sim. | Diverse pro-designed houses | ✓ | ✓ | ✓ | — | k scenes | Link | |
| RWAVS [151] | 2023 | Real | Real-world, Multi-env | ✓ | ✓ | ✓ | 3.8h | 12k samples | Link | |
| Replay [198] | 2023 | Real | Indoor social | ✓ | ✓ | ✓ | k min | M samples | Link | |
| SoundCam [199] | 2023 | Real | Lab, Living, Meeting | ✓ | ✓ | ✓ | — | 2k samples | Link | |
| RAF [200] | 2024 | Real | Real acoustic rooms | ✓ | ✓ | ✓ | — | 47K/39k RIRsb | Link | |
| RealMAN [201] | 2024 | Real | Indoor, Outdoor, Semi, Traffic | ✓ | ✓ | ✓ | 83.7/144.5hc | — | Link | |
| SonicSim [202] | 2025 | Sim. | 3D scenes | ✓ | ✓ | ✓ | 360h | 90 scenes | Link | |
| Abbreviations: V: video frame/visual, A: audio/ambisonics, M: masks/metadata/motion, T: text, Sim.: simulated, Real: real-world measurement/recording. aThe YT-All dataset contains the sub-datasets YT-Music(397 clips) and YT-Clean(496 clips). b47K RIRs for empty rooms, 39K RIRs for furnished rooms. c83.7h voice, 144.5h noise. | ||||||||||
| Metric | Formula | Focus | Task |
| WAV [38] | Similarity | Measures waveform similarity. | |
| STFT Distance [148] | Similarity | Measures time-frequency similarity. | |
| Mag (MAG) [171] | Similarity | Measures magnitude spectrum similarity. | |
| Phs [171] | Similarity | Measures phase spectrum similarity. | |
| ENV [147] | Similarity | Measures temporal envelope similarity. | |
| LRE [205] | Spatial accuracy | Measures left-right energy balance error. | |
| ITD/ILD Error [206] | Spatial accuracy | Measures interaural cue (ITD/ILD) accuracy. | |
| FAD [207] | — | Perceptual quality | Assesses perceptual realism via embeddings. |
| Abbreviations: WAV: waveform L2, STFT Distance: short-time fourier transform distance, Mag (MAG): magnitude distance, Phs: phase distance, ENV: envelope distance, LRE: left-right energy ratio, ITD/ILD Error: interaural time/level difference error, FAD: Fréchet audio distance. The symbols used are: is the ground-truth discrete-time binaural signal, is the synthesized signal. and are their respective STFTs (time frame , frequency ). is energy. and (in dB) are functions for interaural cues. Arrows ( / ) indicate desirable direction. | |||
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHearing Loss and Rehabilitation · Music and Audio Processing · Speech and Audio Processing
Deep Learning for Personalized Binaural Audio Reproduction
Xikun Lu
Yunda Chen
Zehua Chen
Jie Wang
Mingxing Liu,
Hongmei Hu
Chengshi Zheng
Stefan Bleeck
Jinqiu Sang Corresponding author: Jinqiu Sang. E-mail: [email protected]; Chengshi Zheng. E-mail: [email protected]; Hongmei Hu. E-mail: [email protected] Lu is with the Lab of Artificial Intelligence for Education, East China Normal University, Shanghai 200050, China.Yunda Chen is with the Guangdong Key Laboratory of Intelligent Information Processing, College of Electronics and Information Engineering, Shenzhen University, Shenzhen 518060, China.Zehua Chen is with the Department of Computer Science and Technology, Tsinghua University, Beijing 100190, China.Jie Wang is with the School of Electronics and Communication Engineering, Guangzhou University, Guangzhou 511400, China.Mingxing Liu and Jinqiu Sang are with the School of Computer Science and Technology, East China Normal University, Shanghai 200050, China.Hongmei Hu is with the Medizinische Physik, Carl von Ossietzky University of Oldenburg, Oldenburg 26129, Germany.Chengshi Zheng is with the Institute of Acoustics, Chinese Academy of Sciences, Beijing 100045, China.Stefan Bleeck is with the Institute of Sound and Vibration Research, University of Southampton, Southampton SO16 3HH, United Kingdom.
Abstract
Personalized binaural audio reproduction is the basis of realistic spatial localization, sound externalization, and immersive listening, directly shaping user experience and listening effort. This survey reviews recent advances in deep learning for this task and organizes them by generation mechanism into two paradigms: explicit personalized filtering and end-to-end rendering. Explicit methods predict personalized head-related transfer functions (HRTFs) from sparse measurements, morphological features, or environmental cues, and then use them in the conventional rendering pipeline. End-to-end methods map source signals directly to binaural signals, aided by other inputs such as visual, textual, or parametric guidance, and they learn personalization within the model. We also summarize the field’s main datasets and evaluation metrics to support fair and repeatable comparison. Finally, we conclude with a discussion of key applications enabled by these technologies, current technical limitations, and potential research directions for deep learning-based spatial audio systems.
Index Terms:
binaural audio reproduction, head-related transfer function, binaural audio synthesis, personalized modeling, multimodality.
I Introduction
The human auditory system possesses a remarkable ability to perceive the location of sounds within an acoustic environment. This immersive experience is known as spatial sound perception [1, 2]. This capability is vital for communication, navigation, environmental awareness, and creating engaging auditory environments [3, 4]. Consequently, spatial audio reproduction technology, which aims to reconstruct or simulate realistic three-dimensional (3D) sound fields, has attracted significant research interest. This technology has widespread applications in the domain of room simulation [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15], extended reality (XR), encompassing virtual reality (VR) [16], augmented reality (AR) [17], and mixed reality (MR) [18].
Spatial audio reproduction techniques generally fall into three categories based on their implementation approaches and goals [4]. The first category aims to physically reconstruct sound fields with high accuracy, including wave field synthesis (WFS) [19] and higher-order ambisonics (HOA) [20]. These methods can theoretically reproduce sound fields accurately over large areas but typically require many loudspeakers and complex processing, leading to high costs. The second category relies on psychoacoustic principles to approximate physical sound fields, such as traditional stereo and surround sound systems [21]. These systems use fewer loudspeakers to create spatial sensations but provide precise 3D localization only within limited listening areas. The third category, which is the primary focus of this survey, is binaural reproduction. This approach, typically experienced through headphones, aims to simulate the sound pressure signals at a listener’s eardrums. Binaural audio offers highly accurate spatial localization and immersion through headphones [22], making it well-suited for VR, AR, gaming, and mobile applications.
Binaural audio reproduction is achieved via two main technical approaches: (1) techniques based on head-related transfer function (HRTF) filtering, and (2) methods using end-to-end binaural synthesis. Both approaches have rapidly been transformed through modern data-driven methods. Figure 1 illustrates these two paradigms, which form the core structure of the technical section of this survey.
I-A HRTF-based Binaural Reproduction
High-quality binaural reproduction through this approach (Figure 1(a)) depends on accurately modeling the listener-specific acoustic filtering of the head, torso, and the pinnae. This filtering process adds crucial spatial cues to sound signals reaching the eardrums. These cues include interaural time differences (ITDs), interaural level differences (ILDs), and monaural spectral cues [2, 23]. This acoustic transfer function is known as the HRTF [24, 25]. Synthesizing binaural audio typically involves convolving a source audio signal with the appropriate head-related impulse response (HRIR), which is the time-domain representation of the HRTF, for each ear.
A major challenge is that HRTFs vary significantly between individuals [26]. Using non-personalized HRTFs degrades the quality of the spatial quality experience, causing localization errors and reducing immersion [27, 28, 29]. Traditional methods for obtaining personalized HRTFs include acoustic measurement [30, 31], numerical simulation [32, 33], and spatial interpolation [34, 35, 36]. Acoustic measurements provide accurate results but are expensive, time-consuming, and require specialized anechoic environments [30]. Numerical simulations avoid physical measurements but require extensive computing resources and depend heavily on the accuracy of 3D anatomical models [32, 33]. Furthermore, traditional spatial interpolation methods, such as nearest-neighbor approaches [34, 35] or techniques based on functional models [36], can operate with sparse HRTF measurements but often lack the accuracy or efficiency required for real-time personalization. These limitations hinder the widespread adoption of personalized binaural audio.
I-B End-to-End Binaural Synthesis
The second approach (Figure 1(b)) focuses on end-to-end binaural audio synthesis. These models are trained to directly convert various inputs into binaural audio output. These inputs can include mono or multi-channel audio, visual scene information, text descriptions, or other control parameters. This approach can bypass the explicit HRIR convolution step [37], handle complex acoustic scenes, incorporate room acoustics (reverberation), and combine multi-modal information for richer spatial experiences. Although conceptually promising, traditional signal processing methods for synthesis from complex inputs struggle to generalize and capture realistic spatial sound characteristics [38]. In recent years, deep learning (DL) has transformed spatial audio reproduction by offering powerful tools to overcome limitations in both HRTF-based modeling and end-to-end synthesis. DL’s ability to learn complex, non-linear patterns from large datasets has driven significant innovation across the field, forming the central theme of this survey.
I-C Contribution and Scope
The impact of machine learning (ML) on spatial audio has received considerable attention, leading to several relevant surveys. These reviews include surveys focusing on ML applications for HRTF personalization [39, 40, 41], as well as broader overviews encompassing data-driven approaches for spatial audio capture, processing, and reproduction alongside traditional methods [42].
While existing surveys effectively cover significant advancements in applying ML to HRTF modeling, a comprehensive survey dedicated to the full range of DL methods for binaural audio synthesis is still lacking. Specifically, there remains a gap in the literature that systematically examines DL-driven advancements across both main approaches: enhancing HRTF-based techniques and developing end-to-end binaural audio generation from diverse inputs.
This survey aims to fill this gap. Our work provides a structured overview of how DL is reshaping these two fundamental pathways for creating immersive binaural experiences. Notably, this survey offers one of the first comprehensive overview of DL-based end-to-end spatial audio synthesis, alongside a thorough examination of DL innovations in HRTF modeling. We organized and analyze recent advancements in:
- •
HRTF personalized modeling (Section II): Including techniques for HRTF personalization using morphological features, environmental cues, and efficient spatial interpolation.
- •
End-to-end binaural synthesis (Section III): Covering methods driven by various inputs, including single-modal audio or multi-modal approaches incorporating visual, textual, or parametric guidance.
We also discuss relevant datasets and evaluation metrics within their respective sections. The survey then highlights key applications enabled by these DL-driven advancements (Section IV). By organizing our analysis around these core DL-driven pathways for binaural audio, this survey provides researchers with a comprehensive understanding of state-of-the-art strategies, identifies ongoing challenges, and suggests promising directions for future research (Section V).
II HRTF Personalized Modeling
Acquiring accurate and personalized HRTFs is fundamental to high-quality binaural reproduction. By leveraging large-scale data, DL methods effectively address the cost, time, individualization, and spatial resolution limitations of traditional HRTF modeling. This section provides a comprehensive overview of DL applications in HRTF personalization, covering data representation, personalization strategies, dataset fusion, and evaluation methodologies. Figure 2 illustrates the main areas of DL’s contributions.
II-A HRTF Data Representation
Effectively handling the high dimensionality and complex structure of HRTF data is a primary challenge for DL models, making the selection of an appropriate data representation crucial for model performance. Common representation methods include time-domain, frequency-domain, and sparse representations.
II-A1 Time-domain Representation
This approach uses the HRIR sequence directly as input or output for DL models. HRIRs capture the complete temporal evolution of sound waves arriving at the eardrums, including initial onset delays and subsequent reflections from the listener’s anatomy [2, 25].
II-A2 Frequency-domain Representation
The frequency-domain representation decomposes the HRTF into logarithmic magnitude and phase components. Log-magnitude spectra exhibit smoother patterns across frequencies and spatial directions, which better approximate the human ear’s compressive response to sound intensity, potentially facilitating model learning [43]. Phase spectrum modeling is more challenging due to wrapping ambiguities that often require additional processing steps, which can introduce errors. Using complex-valued frequency representations preserves both magnitude and phase information without unwrapping.
Beyond the full HRTF, researchers use specialized frequency-domain representations to isolate specific acoustic phenomena. The directional transfer function (DTF) captures only the direction-dependent HRTF components by removing the direction-independent common transfer function (CTF), which represents average spectral features [26, 24]. This separation simplifies directional cue modeling. Similarly, the pinna-related transfer function (PRTF) isolates acoustic filtering effects specific to the listener’s pinna [44, 45, 46]. These specialized representations help focus modeling efforts on perceptually relevant spatial cues.
II-A3 Sparse Representation
To address the challenge of HRTF dimensionality, sparse representations capture the essential information using fewer parameters. Physics-based approaches employ spherical harmonics (SH) to project the spatial HRTF field onto basis functions, yielding coefficients that inherently represent spatial continuity [47]. The accuracy depends on the SH order, with higher orders required for capturing fine spatial details. Data-driven dimensionality reduction offers an alternative approach. For instance, Autoencoders (AE) learn non-linear mappings to a low-dimensional latent space [48, 49]. Although they achieve significant compression, these methods may not preserve all perceptually relevant acoustic details.
II-B HRTF Personalization from Morphological Features
By using readily accessible data such as anthropometric measurements, photographs, or 3D scans of a listener’s head and pinnae, DL models can predict personalized HRTFs that match an individual’s unique acoustic characteristics. This personalization relies on the strong physical relationship between an individual’s morphology and their HRTF. DL models excel at learning the mapping from these geometric features to the corresponding acoustic transfer functions. This section focuses on methods that directly predict HRTFs from such morphological features. Table I summarizes key studies in this area.
II-B1 Predict from Sparse Anthropometry
A significant research direction focuses on predicting HRTFs from a limited set of easily measurable anthropometric parameters. Initial studies demonstrated that this approach is feasible using artificial neural networks (ANNs) [50]. To enhance performance with limited input data, researchers have employed representation learning techniques, such as AEs and their variants. These models reduce dimensionality by learning compact latent representations of HRTFs that can then be predicted from anthropometric data [48, 53, 54]. Integrating signal processing knowledge through the spherical harmonic transform (SHT) [51] or spatial principal component analysis (SPCA) [52] as intermediate steps has helped to incorporate structural information into the learning process.
Recently, more advanced architectures and powerful generative models have further improved HRTF personalization. VAEs have been further refined for HRTF prediction from morphological data [54]. A major advance is the emergence of diffusion models, which show promise for generating complete HRIR distributions from anthropometric inputs, potentially capturing finer details of HRTFs [56]. Alongside these generative approaches, neural network architectural elements like attention mechanisms have enhanced model effectiveness [55]. Attention mechanisms allow models to dynamically focus on the most important anthropometric features during prediction, leading to more accurate HRTF estimations.
II-B2 Predict from Image
2D images, especially photographs of the pinna, provide richer geometric detail than sparse numerical measurements. Early studies in this area often combined image-derived features with traditional anthropometric parameters, demonstrating that the inclusion of visual data reduced spectral distortion compared to using anthropometry alone [29, 58]. Subsequent research has moved toward methods driven purely by image inputs. Conditional Variational Autoencoders (CVAE) have been used to predict HRTF from features extracted from pinna images [57]. Specialized architectures like PRTFNet [59] have been designed to predict PRTFs, which are crucial for vertical sound localization, from pinna images. Further advancements aim to predict both personalized HRTFs and corresponding headphone equalization filters from 2D images or short videos of the pinna [60], thereby improving the quality and accessibility of personalized spatial audio.
II-B3 Prediction from 3D Geometric Models
3D models from body scans provide the most complete information about an individual’s head and pinna morphology. Early research in this area focused on automatically extracting a predefined set of anthropometric features from these 3D models for HRTF prediction [61]. However, such methods are limited by the manual selection of these features. A more direct approach involves predicting HRTFs directly from the raw 3D scan data. CNNs and U-Net architectures have been employed to process 3D pinna shapes, represented as point clouds or voxel grids. These models capture fine geometric details, leading to improved prediction accuracy, particularly for the medium and high frequency components of the HRTF [62]. Advances in geometric DL have enabled more efficient methods. For instance, some methods project 3D scan data onto mathematical basis functions before inputting them to the network [63]. Specialized network designs have also been developed, such as cascading CNNs or models incorporating symmetries like vertical plane feature sharing, to reduce computational needs while maintaining high accuracy [63, 64]. Recognizing the difficulty in obtaining high-quality 3D scans, recent research has also focused on improving data acquisition and preprocessing. Advanced DL frameworks aim to reconstruct 3D pinna models from more easily obtainable 2D images [65, 60], while denoising models have been developed to improve the quality of scanned data, which can be noisy or incomplete [66].
II-C HRTF Personalization Using Environment Cues
The structural features of the human pinna provide a fundamental foundation for HRTF personalization. However, relying exclusively on physical measurements often proves insufficient for achieving optimal spatial audio quality and perceptual accuracy in real-world listening scenarios. These limitations, coupled with the practical difficulties in acquiring precise physical measurements, have led researchers to investigate alternative approaches that extract HRTF characteristics from acoustic cues embedded in the listener’s interaction with their environment.
This emerging research direction estimates personalized HRTFs by analyzing how everyday sounds are modified by the listener’s presence and movements. The core concept is that acoustic interactions between listeners and their surroundings implicitly encode information about their HRTFs. Jayaram et al. [67] designed a U-Net architecture that learns to infer the listener’s HRTF by observing changes in binaural audio recordings captured as listeners make natural head movements in response to ambient sounds. Similarly, Thuillier et al. [68] used diffusion models to jointly estimate room impulse responses (RIRs) and HRTFs from binaural recordings, showing promise in preserving high-frequency individual differences in the estimated HRTFs. The growing availability of mobile devices with microphones makes such “in-the-wild” personalization increasingly attractive. This approach could make personalized spatial audio more accessible without requiring laboratory measurements or specialized equipment.
II-D HRTF Spatial Interpolation
Beyond personalization based on individual features, DL addresses key challenges in HRTF modeling related to spatial resolution limitations and representation efficiency. Traditional interpolation methods struggle with accurately capturing the non-linear spatial-spectral characteristics of HRTFs, especially when upsampling from sparse measurements [69, 70]. This limitation stems from their underlying linear assumptions and reliance on manually designed features. DL-driven spatial interpolation research has advanced significantly along two main paths: discrete domain interpolation and continuous domain representation. Table II summarizes key studies in this area.
II-D1 Discrete Spatial Interpolation
Discrete domain methods work with HRTFs sampled at specific spatial locations. They learn the underlying spatial relationships to reconstruct denser HRTF sets from sparse measurements. Common DL architectures for this task include AEs, CNNs, and generative adversarial networks (GANs).
Autoencoders (AEs). AEs excel at interpolation tasks through their ability to learn compressed data representations (latent codes). For HRTFs, such architectures learn sparse representations that enable interpolation at unmeasured locations [75]. Research has substantially improved AE-based HRTF interpolation. These improvements include better handling of position dependencies between sparse inputs and the ability to extrapolate from minimal data. Conditional mechanisms guide the interpolation process based on specific sparse input locations [71, 72]. Some approaches integrate vector quantized variational autoencoders (VQ-VAEs) to learn discrete latent representations of HRTFs, which facilitates more effective subsequent mapping to denser spatial resolutions [73]. Other researchers have focused on optimizing the latent space by encouraging spatial grouping of HRTFs [74]. These efforts improved interpolation accuracy while preserving crucial spectral details, demonstrating the flexibility of the AE framework for capturing complex HRTF structures and supporting efficient HRTF acquisition from minimal measurements.
Convolutional Neural Networks (CNNs). CNNs, known for their ability to extract local features from structured data like images, have been adopted for HRTF processing. By treating HRTF data as image-like inputs [76, 80], CNN architectures effectively perform spatial upsampling. U-Net architectures with an encoder-decoder structure and skip connections prove particularly effective as they capture both local details and global context in HRTF data [76]. Recent CNN-based HRTF interpolation advances focus on incorporating physical constraints and using anthropometric measurements as additional input features. These approaches enhance model robustness and generate more realistic HRTFs, especially when interpolating from very sparse measurements or extrapolating to unmeasured spatial regions [79].
Given the spherical nature of HRTF data, researchers have developed specialized architectures. Spherical CNNs (S-CNNs), for instance, perform convolutions directly on the sphere by defining filters in the SH domain, thereby inherently respecting the data’s geometric structure [77]. This approach enables more efficient learning and better generalization. Building on this concept, the Spherical Convolutional Conditional Neural Process (SConvCNP) [78] leverages S-CNNs within a meta-learning framework specifically for HRTF error interpolation from sparse measurements. This approach allows the model to not only refine HRTF estimates but also adaptively correct biases and provide well-calibrated uncertainty estimates, leading to significantly improved sample efficiency. These specialized architectures demonstrate the advantages of tailoring network designs to the specific geometric properties of HRTF data.
Generative Adversarial Networks (GANs). GANs provide a powerful data-driven approach for HRTF spatial upsampling, using a generator network to produce realistic HRTFs and a discriminator network to distinguish them. One effective strategy converts spherical HRTF data into 2D image-like representations through projection, enabling standard CNN-based GAN architectures such as the Super-Resolution-based GAN (SR-GAN) [80] for upsampling. Other approaches work directly in the SH domain, using specialized GANs to generate high-order coefficients from low-order ones, thereby reconstructing the complete HRTF field [81]. To address the challenge of sparse and potentially noisy input measurements, the HRTF Denoising U-Net (HRTF-DUNet) [82] combines a U-Net-based denoiser with an autoencoding GAN (AE-GAN). This approach effectively upsamples HRTFs even from highly sparse measurements.
II-D2 Continuous Spatial Representation
Implicit neural representations (INRs) [90, 91] offer a compelling alternative to discrete-domain methods by overcoming the limitations of fixed spatial grids. INRs use a neural network to learn a continuous mapping from spatial coordinates, typically azimuth angles and elevation angles , to the corresponding HRTF complex values, denoted as :
[TABLE]
This approach represents the HRTF as a continuous differentiable function that can be queried at any arbitrary spatial location. Following success in computer vision for image and shape representation [91, 92, 93] and acoustic field modeling [94, 95], INRs are now being explored for high-fidelity continuous HRTF representation.
Research applying INRs to HRTF modeling has advanced rapidly. Early work demonstrated the potential of conditioning INRs on subject identity [83]. Their DL architecture, incorporating Feature-wise Linear Modulation (FiLM) layers and hyperconvolution, dynamically modulated conditional information to accurately predict HRTFs across different datasets and coordinate systems, effectively capturing individual HRTF spatial patterns. Subsequent studies introduced more advanced architectures like the Implicit Gradient Origin Network (IGON) [84], which demonstrated strong capabilities in learning continuous HRTF representations that generalize across individuals. IGON maps limited HRTF samples to continuous representations by learning spatial distributions and employing improved optimization strategies. This leads to better preservation of spectral details and spatial continuity. For extremely sparse measurement scenarios, Retrieval-Augmented Neural Field (RANF) [88] significantly improves personalized HRTF generation from very few data points by combining database retrieval of relevant HRTF exemplars with neural field learning. This approach creates promising paths for fast, low-cost personalized HRTF acquisition. Neural Steerer [87] further demonstrates how INRs can accurately model array steering vectors by explicitly accounting for important aspects such as inter-channel phase relationships and causality.
To enhance neural field models for HRTFs, researchers have focused on optimization strategies, including prior knowledge integration and improved modeling of key acoustic features. The Neural Infinite Impulse Response Filter Field (NIIRF) [86] incorporates infinite impulse response filter structures, allowing the network to predict filter parameters rather than direct HRTF values. This reduces model size and improves efficiency. The study also found that low-rank adaptation (LoRA) techniques effectively balance model efficiency and personalization performance. Adding physical constraints, like the Helmholtz equation as a training regularization term [96], enhances generalization from sparse data and ensures the physical consistency of the predicted HRTF [85], particularly for high-frequency components. Other efforts focus on improving prediction accuracy for critical perceptual features. For instance, Lu et al. [89] aim to directly predict specific binaural cues such as ITDs and ILDs, producing HRTFs that are both objectively accurate and perceptually convincing.
II-E HRTF Dataset Fusion Strategies
DL approaches to HRTF personalization depend on high-quality datasets. These datasets contain numerous HRIR samples obtained through acoustic measurements on human subjects or dummy heads, or through numerical simulations based on 3D geometry models. Such data serves as the ground-truth necessary for effective model training. Most publicly available HRTF datasets are stored and shared using the Spatially Oriented Format for Acoustics (SOFA)111https://www.sofaconventions.org/ to support distribution and compatibility. Table III provides details of these datasets, all of which follow this standardized format. Despite adopting a common storage convention and growing in number, these datasets show notable differences. Variations arise from different measurement equipment, anechoic environments, microphone types, processing methods, and spatial sampling patterns [107]. These differences make direct comparisons across datasets difficult [108]. More importantly, such diversity limits how well DL models trained on a single dataset can generalize to new situations, creating a significant bottleneck for technological progress in this field.
To address this challenge, INRs offers a promising new approach. By modeling the HRTF as a continuous function of spatial coordinates, INRs naturally accommodate data with different or irregular spatial samplings. This breakthrough enables the integration of data from various sources with different protocols, potentially expanding the scale of effective training data substantially. Effective cross-dataset fusion typically combines coordination at the data level with adaptability at the model level. Researchers are exploring preprocessing techniques, including advanced normalization methods to reduce systematic biases between datasets and provide more consistent input [109]. Additionally, the continuously differentiable property of INRs naturally accommodates HRTF data with different spatial grid structures. In the SONICOM Listener Acoustic Personalization (LAP) Challenge222https://www.sonicom.eu/lap-challenge/, using neural fields to integrate heterogeneous data for HRTF modeling and upsampling has emerged as a viable strategy. Fusion frameworks also support more precise modeling of key perceptual features. Researchers use fused data and INR architectures to learn how to generate important binaural cues while optimizing data preprocessing to enhance specific cue accuracy [89]. This demonstrates that fusion extends beyond increasing data volume to deepening understanding and achieving precise control over acoustic features.
In conclusion, addressing HRTF dataset heterogeneity represents a critical step toward advancing personalized spatial audio. Technologies like INRs enable effective fusion of diverse data, overcoming limitations of the traditional methods. These advances lead to more universal, higher-precision HRTF models and establish a solid foundation for the widespread application of personalized spatial audio.
II-F Evaluation Methodologies for HRTF Modeling
Evaluating DL-based HRTF modeling techniques requires robust assessment methodologies. These methodologies fall into two main categories: subjective perceptual evaluations that directly assess listener experience, and objective evaluations using computational metrics. Objective methods are further divided into signal-based metrics and model-based metrics that use DL to predict perceptual outcomes or simulate human auditory processing.
II-F1 Subjective Perceptual Evaluation
Subjective perceptual evaluation conducted via human listening tests serves as the gold standard for validating the effectiveness of HRTF personalization and the performance of spatial audio systems [2, 23]. These tests aim to measure how well a modeled HRTF reconstructs key perceptual attributes of an individual’s own HRTF, notably sound source localization accuracy, externalization, and timbral naturalness [115, 116].
Sound source localization tasks are fundamental to assessing spatial accuracy. Listeners indicate the perceived direction of sound sources using graphical interfaces or head-pointing. Performance is typically measured using mean absolute error (MAE) or root mean square error (RMSE) in degrees, along with front-back and up-down confusion rates, which provide direct insights into spatial fidelity [22]. Many HRTF personalization studies use these tasks to demonstrate improvements over generic HRTFs [50, 29, 55].
Attribute rating of single stimulus using scales assesses qualitative aspects of the auditory experience for individual stimuli, such as externalization, timbral fidelity, or spatial impression. Listeners typically use Likert or continuous visual analog scales to rate stimuli on specific attributes. The results are often quantified using mean opinion scores (MOS) that reflect overall perceived quality or naturalness [117, 118]. MOS ratings can be adapted to assess different perceptual dimensions within spatial audio synthesis. For example, studies may report a MOS for overall sample quality, a ’similarity MOS’ for likeness to ground-truth, or a ’spatial MOS’ for perceived spatial accuracy.
Comparative evaluation tasks involve direct comparisons between spatial audio stimuli or rendering methods. These tasks help to identify subtle perceptual differences. A basic approach is A/B comparison, where listeners judge differences between a reference and a target stimulus presented alternately. More rigorous methods include forced-choice paradigms such as the two-alternative forced choice (2AFC) test, which requires participants to identify specific stimulus characteristics across intervals [24, 76]. To assess preference or quality, researchers employ paired comparisons where listeners compare stimuli based on given criteria. Furthermore, standardized tests like the Multiple Stimuli with Hidden Reference and Anchor (MUSHRA) test enable comprehensive audio quality assessment, requiring listeners to rate multiple stimuli against a hidden reference on a continuous scale.
II-F2 Objective Evaluation Metrics
Objective metrics provide quantitative, automated, and repeatable assessments that complement subjective evaluations.
Signal-based Metrics. These metrics quantify physical differences between predicted and ground-truth HRTFs (or HRIRs) based on their signal properties without relying on perceptually trained models. Common metrics are summarized in Table IV.
A primary assessment approach involves quantifying spectral differences, which evaluate discrepancies in the frequency domain crucial for localization and timbre. The log-spectral distortion (LSD) is widely used, with lower values indicating closer physical agreement [110]. Other metrics like spectral distance error (SDE) [62], logarithmic relative error (LRE), and logarithmic magnitude distance (LMD) [68, 78], offer alternative ways to quantify spectral deviations with varying sensitivities to power spectrum disparities. Another class evaluates overall signal differences, often in the time domain or complex spectra. RMSE [111, 55] and MAE [112] are common, with MAE being less sensitive to outliers. Metrics assessing signal-to-error ratio, like signal-to-distortion ratio (SDR) [113, 114], quantify the prediction fidelity relative to error magnitude, where higher values indicate better performance. Statistical correlation metrics such as the Pearson correlation coefficient (PCC) [74] measure the linear relationship between features of predicted and target HRTFs, including how well underlying trends are captured.
Model-based Metrics. These metrics use computational models that simulate aspects of human hearing or are trained on perceptual data to predict spatial audio quality. They aim to bridge the gap between signal-based measures and subjective tests.
Auditory models (AMs) simulate key stages of human hearing to predict perceptual aspects of HRTF-spatialized audio, such as localization accuracy or timbre perception [119, 120]. For spatial audio evaluation, these models process binaural signals to estimate how listeners would perceive them [121, 112]. AMs typically employ functional approaches that explicitly model auditory mechanisms, using cue-based analysis and template-matching strategies [122]. The Auditory Modeling Toolbox offers accessible resources for implementing them [119]. Recent AM advances focus on predicting sound quality changes caused by HRTF modifications and modeling complete 3D spatial perception through Bayesian methods [123, 124, 125]. While AMs provide consistent metrics useful for early-stage model development, they simplify the complex hearing pathway. They require careful parameter adjustments, and their validation against human perception remains challenging, especially with regard to higher-level cognitive influences.
Data-driven perceptual predictors use DL to directly predict human ratings of spatial audio quality from acoustic features. These approaches offer scalable alternatives to both traditional metrics and resource-intensive listening tests. Inspired by successes in speech quality assessment where models predict MOS for synthesized speech [126, 127, 128, 129], similar methodologies have emerged for spatial audio evaluation [130, 131, 132, 133]. Notable developments include the Deep Perceptual Spatial Audio Localization Metric (DPLM) [130], which quantifies localization differences using learned embeddings from a direction-of-arrival (DOA) estimation network. The Spatial Audio Quality Assessment Metric (SAQAM) [131] evaluates both listening quality and spatialization quality through multi-task learning. Furthermore, the Spatialization Quality Metric for Binaural Signals (SQM-BS) [132] introduces deep metric learning and multi-task learning to assess the spatialization quality between pairs of binaural signals, designed to be independent of content and duration. Based on SAQAM, the Human Auditory Perception Guided SAQAM (HAPG-SAQAM) [133] incorporates auditory-guided feature extraction and perceptually weighted loss functions for improved alignment with human judgments across various quality dimensions. However, the performance of these data-driven models depends on their training data quality and diversity, from which they can inherit biases [134]. Additionally, their “black box” nature limits the understanding of their decision processes [135], and they may struggle with entirely new conditions [136]. Despite these challenges, ongoing research aims to improve their reliability and interpretability.
III Binaural Audio Synthesis with Deep Learning
DL has expanded spatial audio capabilities, advancing beyond traditional HRTF-based methods discussed in Section II towards direct, end-to-end spatial audio synthesis. This advancement builds on recent breakthroughs in audio generation technologies, including text-to-audio (T2A) [137, 138, 139] and video-to-audio (V2A) [140, 141, 142]. Spatial audio synthesis presents significant challenges that require both sound realism and precise spatial accuracy, aligned with contextual information.
End-to-end neural models excel at this task by implicitly encoding spatial cues without requiring explicit HRTF measurements [37, 38]. This approach improves adaptability and reduces reliance on specific HRTF datasets while achieving a better balance between computational demands and perceptual quality than physics-based simulations. Recent approaches leverage advanced architectures, notably including U-Nets [143], Transformers [144], and diffusion models [145]. Combined with self-supervised and multi-task learning [146] techniques, these methods demonstrate enhanced audio quality, better generalization, and potential for real-time applications.
III-A Synthesis Paradigms Based on Input Modalities
DL-based spatial audio synthesis can be categorized by the main input modalities that guide the generation process. As shown in Figure 3, these approaches fall into single-modal or multi-modal categories.
In single-modal synthesis, shown in Figure 3(a), the main task transforms an input audio signal into the binaural format. The conditioning information consists of explicit spatial parameters such as source/listener position and orientation. The synthesis task can be formulated as:
[TABLE]
Multi-modal synthesis, as illustrated in Figure 3(b), enhances this approach by adding non-auditory information alongside the source audio (if present). This provides richer contextual cues for spatialization. The general formulation is:
[TABLE]
where might sometimes be absent if the task is to generate all sound from non-auditory cues. Key multi-modal approaches, differentiated by the nature of the non-auditory information , include:
- •
Visual-guided Synthesis: In this approach, the non-auditory input is visual data. This can range from 2D images and videos that inform sound source properties and locations [147, 148, 149, 150], to detailed 3D scene geometry (e.g., point clouds, meshes) that helps model environmental acoustics and precise spatial relationships [151, 152, 153, 154].
- •
Text-guided Synthesis: Here, the non-auditory input consists of natural language descriptions. These textual prompts are used to control various spatial audio characteristics during generation, such as the type of acoustic environment, source positions, or sound event semantics [155, 156, 157].
- •
Joint Multi-modal Guided Synthesis: This category leverages a combination of non-auditory modalities, for instance, it can integrate both visual information and textual descriptions to achieve more comprehensive and nuanced control over the spatial audio synthesis [158, 159, 160, 161].
The following sections explore these single-modal and multi-modal synthesis approaches in detail, highlighting key methodologies, architectural innovations, and recent advancements in the field.
III-B Single-modal Synthesis: Spatialization from Audio
Single-modal spatial audio synthesis focuses on generating spatialized audio primarily from audio signals, with spatial parameters such as source and listener position or orientation are typically provided as conditioning metadata. This approach transforms monaural audio inputs into immersive binaural audio by learning the acoustic filtering that occurs as sound interacts with the listener and their environment. Key methods are summarized in Table V.
Early research established the feasibility of using DL for this application. A Temporal Convolutional Network (TCN) was shown to be capable of directly synthesizing binaural audio in reverberant environments, with performance matching that of traditional HRTF filtering [37]. Building on this foundation, the Warping Network (WarpNet) introduced architectural and loss functions improvements to produce realistic and spatially accurate binaural sound in real-time [38]. These initial studies confirmed that DL models could learn the intricate mono-to-binaural transformation without requiring explicit HRTF data for every scenario.
Subsequent research focused on enhancing synthesis quality through more powerful generative models. A modified vector-quantized variational autoencoder (VQ-VAE) was developed for speech binauralization, designed to accurately reproduce environmental factors such as background noise and reverberation [162]. Diffusion models led to significant quality improvement, especially in phase spectrum estimation. BinauralGrad [163], for example, employed a two-stage framework that used diffusion models to synthesize the common and specific parts of the binaural audio separately. More recently, DIFFBAS [164] incorporated perceptually motivated interaural phase difference (IPD) losses directly into the diffusion process, which substantially improved realism.
Researchers have also addressed the challenge of rendering dynamic scenes and improving computational efficiency [165, 166, 167]. For the synthesis of moving sound sources, the Dual Position Attention Time-Frequency Network (DPATFNet) [166] uses attention mechanisms to track sound source movement and improve phase estimation. Similarly, Zhang et al. [167] proposed a two-stage framework with a position-orientation self-attention (POSA) module to integrate spatial information and capture source motion. To reduce computational demands, Neural Fourier Synthesis (NFS) [168] achieved significant reductions in model size and inference time by performing synthesis in the frequency domain, predicting the delays and scales of early reflections based on geometric time delays. These advances demonstrate the growing ability of DL to capture subtle acoustic details that are crucial for convincing spatialization in dynamic environments.
A persistent challenge in supervised mono-to-binaural synthesis is the need for extensive paired monaural and binaural recordings that are costly to acquire. Zero-shot learning approaches offer a promising solution to this data limitation. ZeroBAS [169] represents a pioneering effort in this direction, successfully synthesizing binaural audio without paired training data by combining geometric time warping (GTW) techniques with pre-trained generative audio models. This line of research shows potential for developing more adaptable and personalized binaural synthesis systems with reduced data requirements.
III-C Multi-modal Guided Spatial Audio Synthesis
Researchers are exploring multi-modal guided binaural audio synthesis to improve realism, accuracy, and interactive control beyond what audio and spatial parameters alone can achieve. This approach enhances sound reproduction by combining non-auditory information with source audio signals to guide spatial audio generation. Visual and textual cues are the primary additional inputs in this process. Table VI summarizes key methods, organized by their main guiding input modalities.
III-C1 Visual-guided Synthesis
Visual information from static images, videos, or 3D scene representations provides valuable cues about sound source characteristics, scene layout, and acoustic properties. These visual cues can enhance the spatial accuracy, environmental realism, and scene consistency of synthesized binaural audio. The field has evolved from basic fusion techniques to sophisticated modeling of audiovisual interactions and environmental acoustics.
Early research demonstrated the effectiveness of combining visual and audio features. Mono2binaural [148] used a CNN to extract global visual features from video frames, which were then combined with audio features to guide the synthesis. To address the limited availability of annotated binaural audiovisual datasets, self-supervised learning became essential. Morgado et al. [147] developed a self-supervised method for learning audiovisual spatial correspondence from video. Similarly, the Audio Spatialization Network (ASN) [170] employed self-supervision with an auxiliary classifier to learn spatial information implicitly.
Attention mechanisms later became vital for precise audiovisual association and feature fusion. These techniques allow models to focus on visual regions most relevant to current audio events, improving accuracy in complex scenes [175, 173]. The Multi-Attention Fusion Network (MAFNet) [175] used both self-attention within the visual modality and cross-modal attention to selectively integrate relevant visual features with audio. Similarly, Li et al. [172] applied attention mechanisms to effectively combine visual and audio features, while the Localize-to-Binauralize Network (L2BNet) [173] incorporated attention modules to strengthen the connections between visual cues and inferred source locations prior to synthesis.
Multi-task learning emerged as another effective strategy for enhancing model understanding. By jointly optimizing binaural synthesis with auxiliary tasks such as sound source separation [149, 171], models can develop a more comprehensive understanding of sound sources and their spatial layout. Further refinements included flipped audio classification [172] to encourage consistent spatial representations and left-right consistency enforcement between audio and visual modalities [174] to align the generated audio with the visual content both semantically and spatially.
The incorporation of 3D scene structure marked a significant step toward greater realism. Initial efforts used depth maps as additional input, as seen in Bmonobinaural [176], which leveraged depth as a proxy for distance information. Other approaches utilized explicit 3D geometric representations such as point clouds; for example, Points2Sound [177] employed 3D sparse convolutional networks to process such representations. Geometric constraints, such as enforcing spatial consistency between audiovisual streams [178], helped to refine the synthesis. To reduce reliance on paired binaural data, PseudoBinaural [171] used only visual information with HRIR models. Advanced self-supervised techniques like contrastive learning enhanced audiovisual representations; for instance, Contextual and Contrastive Stereophonic Learning (CCStereo) [181] improved spatial sensitivity through negative-sample mining from shuffled visual features.
Recent trends show the integration of these advancements with powerful generative models and complex cross-modal frameworks. The Cyclic Locating-and-UPmixing (CLUP) model [150] enables visual sound object localization and binaural generation to enhance each other through cyclic learning. The Stereo Audio Generation Model (SAGM) [180] uses shared spatio-temporal visual information to guide both generator and discriminator components in a GAN. Liu et al. [179] proposed generating the left and right audio channels separately with visual guidance and introduced a cross-modal matching loss to explore audiovisual correlations. For video, OmniAudio [182] uses a Transformer-based dual-branch architecture with self-supervised pre-training to process the complete visual context.
A distinct research direction focuses on modeling environment acoustics using 3D scene geometry derived from visual input. This approach aims for physical realism by simulating sound-environment interactions, going beyond research solely focused on RIR estimation [94, 189, 95, 190, 191]. These methods integrate environmental acoustic modeling directly into the spatial audio synthesis pipeline. A key technical approach is to adapt advanced 3D scene representation techniques, such as neural radiance fields (NeRF) and 3D gaussian splatting (GS). These methods excel at learning detailed 3D geometry from multi-view images, which then inform acoustic propagation models to render spatial audio with environment-specific effects. NeRF-based methods [151, 183] explore the use of density fields for acoustic rendering. Methods based on GS, including AV-GS [152], AV-Cloud [184], Scene Occlusion-aware Acoustic Field (SOAF) [153], and AV-Surfs [154], leverage GS representations for acoustic simulations. Notably, AV-Surfs [154] also estimates surface properties to determine acoustic materials, enabling more physically accurate environmental sound rendering.
III-C2 Text-guided Synthesis
Text-driven spatial audio synthesis offers a more flexible control method compared to visual approaches. It allows users to specify desired sound field characteristics through natural language descriptions, reducing creation barriers and enabling more personalized audio experiences. The main challenge lies in translating unstructured language into structured spatial parameters or effective conditions for generative audio models.
This research area is emerging but shows significant potential. Methods typically follow a two-step process: First, using Natural Language Processing (NLP) techniques, especially Large Language Models (LLMs) or specialized semantic parsing, to extract key spatial information such as source type, location, motion, and environment from text descriptions [192, 193, 194]; second, using this parsed structured information to guide spatial audio synthesis models.
While traditional parametric renderers can work with such structured information, recent research increasingly uses deep generative models for improved synthesis quality and flexibility. Text-guided Audio Spatialization (TAS) [155] demonstrated the conversion monaural audio into spatial audio based on text prompts, offering an adaptable alternative to audiovisual methods. Similarly, SpatialTAS [186] employed a latent diffusion model conditioned by text embeddings to achieve flexible audio spatialization, allowing control over source direction, distance, and relative positions. AudioSpa [157] applied LLMs to process both acoustic and textual information, using fusion multi-head attention to integrate text tokens and enhance multi-modal learning capabilities. DualSpec [156] implemented conditional diffusion models that generate high-quality, spatially controllable binaural audio directly from text descriptions.
III-C3 Joint Multi-modal Guided Synthesis
Achieving comprehensive, robust, and interactive spatial audio synthesis requires a framework that can effectively combine multiple modalities, including the audio content itself, visual scene information, textual commands, and potential user interactions. Research in this area explores deep cross-modal learning and advanced generative models capable of handling diverse inputs.
Several recent works demonstrate this trend through advanced generative approaches. SpatialSonic [160], pre-trained on large-scale simulated data, shows how diffusion models accept multi-modal conditions for flexible spatial audio generation. SEE-2-SOUND [159] aims for zero-shot visual-to-spatial audio mapping, generating plausible spatial sound for novel visual scenes without specific paired training data, requiring strong model generalization capabilities.
Application-focused research is driving further integration across modalities. ImmerseDiffusion [158] combines spatial, temporal, and environmental conditions within a Diffusion Transformer (DiT) model to generate immersive speech streams for communication contexts. Diff-SAGe [187] generates first-order ambisonics conditioned on sound category and sound location, targeting applications that benefit from standardized ambisonic formats. For silent video applications, Video-to-Spatial Audio Generation (ViSAGe) [161] produces first-order ambisonics by using Contrastive Language-Image Pre-Training (CLIP) visual features and an autoregressive neural audio codec model that incorporates both directional and visual guidance. In creative applications like spatial drama generation, Immersive Spatial Drama (ISDrama) [188] uses rich multi-modal prompts including scripts, video, and character poses to guide a Mamba-Transformer model. This approach includes specific mechanisms for unified pose encoding to address motion effects and aims for detailed prosody control in generated spatial dialogue.
III-D Datasets for Binaural Synthesis
The advancement of DL-based binaural audio synthesis depends on suitable training datasets, with requirements that vary according to the task specifications. Table VII summarizes key public datasets available to researchers.
III-D1 Audio-driven Synthesis Datasets
This category primarily encompasses tasks where monaural audio serves as the input for synthesizing binaural audio. High-quality synchronized monaural-binaural audio pairs from diverse acoustic conditions are essential for training robust models. However, publicly available datasets remain somewhat limited. The Binaural Speech dataset [38] represents an early contribution, containing approximately two hours of high-quality, real-world recordings of dynamic dialogues with head-tracking information. However, its limited scale constrains model performance across varied scenarios. To address this data scarcity, researchers have resorted to re-recording existing audio corpora under controlled binaural conditions [162, 203] or generating synthetic data through acoustic simulation. It is worth noting, however, that such synthetic datasets are not typically made public.
III-D2 Multi-modal Guided Synthesis Datasets
These tasks require datasets that feature synchronized audio alongside additional guiding modalities such as visual content, textual descriptions, or explicit spatial information.
Visual-guided Datasets. Synthesis methods that leverage video or image content represent a significant area of research. Foundational work utilized real-world video datasets including REC-Street, YT-Clean, and YT-Music [147], often for self-supervised learning approaches that paired visual content with spatial or binaural audio. Music-focused applications benefit from specialized datasets such as FAIR-Play [148] and MUSIC-Stereo [171]. To overcome limitations of real-world data availability, researchers have developed large-scale synthetic or processed datasets. SimBinaural [178] provides paired video, binaural audio, and RIRs from simulated 3D indoor scenes. The YouTube-Binaural dataset [178] extends accessibility by converting ambisonic audio from real YouTube videos into pseudo-binaural labels for training.
Text-guided Datasets. Text-guided synthesis represents an emerging area that often necessitates custom dataset creation. The Text-guided Audio Spatialization Benchmark (TASBench) dataset [155] establishes the text-guided audio spatialization task, featuring dense, frame-level text annotations for evaluating fine-grained control capabilities; this dataset is not open source. The large-scale Both Ears Wide Open 1M (BEWO-1M) dataset [160] provides spatial audio paired with detailed textual descriptions and optional images, supporting broader text and image-to-spatial audio tasks.
Environment-related Datasets. Research incorporating 3D geometry or environment acoustics requires datasets with richer contextual information. The SoundSpaces dataset [195, 196] offers navigable 3D environments derived from scans such as Matterport3D [204], providing binaural RIRs for early audio-visual navigation studies. For real-world applications, the Real-world Audio-Visual Scene (RWAVS) dataset [151] features scenes captured by a moving camera, including video, binaural, and source audio, along with pose data for models requiring geometric awareness. The Replay dataset [198] provides over 4000 minutes of real indoor social interactions with multi-view video and multi-channel or binaural audio recordings. Additional relevant datasets support acoustic modeling research, including the Geometric Wave Acoustic (GWA) dataset [197], SoundCam [199], Real Acoustic Fields (RAF) [200], Real-recorded and Annotated Microphone Array Speech & Noise (RealMAN) [201], and SonicSet [202]. These resources, while less commonly used in the reviewed synthesis methods, offer valuable data for specialized audio processing tasks.
III-E Evaluation Methodologies for Binaural Audio Synthesis
Evaluating DL-based binaural audio synthesis requires a comprehensive approach that assesses both signal fidelity, spatial accuracy, and the overall perceptual quality of the generated sound field. Evaluation methodologies include subjective listening tests and objective metrics; the latter category encompasses both signal-based measures and model-based approaches that predict perceptual attributes.
III-E1 Subjective Listening Tests
Subjective listening tests remain the gold standard for evaluating the perceptual quality of synthesized binaural audio. The fundamental methodologies for conducting these assessments were detailed in Section II-F.
These tests are designed to evaluate key aspects of the synthesized listening experience [166, 38]. Listeners typically compare synthesized audio with ground-truth signals, providing ratings on dimensions such as overall audio quality, spatial realism, timbral naturalness, and immersion. Mean Opinion Score (MOS) ratings are used frequently in these evaluations and can be adapted to address specific attributes relevant to the synthesis task, such as ’spatial MOS’ or ’comparison MOS’. For multi-modal synthesis, the evaluation must also assess the coherence and plausibility of the synthesized audio relative to the guiding non-auditory cues [155], as well as the overall perceived realism of the combined multi-modal scene.
III-E2 Objective Metrics for Synthesized Audio
Objective metrics quantify differences between synthesized and ground-truth signals or assess adherence to acoustic and spatial goals. These metrics vary based on the synthesis task and fall into signal-based measures and model-based perceptual predictors. Table VIII presents common metrics used in this field.
A primary category of metrics assesses waveform and spectral similarity. Direct time-domain comparisons often use the L2 distance between waveforms (WAV) [38, 163]. In the frequency domain, the short-time Fourier transform (STFT) distance [148] measures the overall dissimilarity in time-frequency representation. More specifically, magnitude spectra distance (Mag/MAG) [171] and phase spectra distance (Phs) [171] are often evaluated separately, as these components contribute differently to perception. The envelope distance (ENV) [147] assesses structural similarity over time by comparing temporal energy contours, which relate to perceived dynamics and transients. Metrics evaluating spatial accuracy are critical for binaural synthesis. The left-right energy ratio error (LRE) [205] measures discrepancies in the energy balance between channels, which relates to perceived source width or lateral balance. Errors in key binaural cues, such as ITD and ILD (ITD/ILD Error) [206], provide direct measures of how accurately fundamental localization cues are reproduced. For generative models that lack a direct ground-truth comparison, metrics like Fréchet-audio distance (FAD) [207] are used. FAD compares the statistical distributions of embeddings from synthesized audio and a set of real target examples to assess overall perceptual realism and diversity.
IV Applications and Impact of DL-powered Spatial Audio
Spatial sound provides rich directional, distance, and environmental information, making it a crucial sensory modality. Deep learning (DL) techniques have significantly improved the generation, processing, and integration of spatial audio into computational systems, enhancing performance across numerous applications. These advances enable spatial audio to serve both as a synthesis target for immersive experiences and as an essential input for intelligent systems focused on perception, interaction, and environment understanding. This section explores these advancements through two main categories: applications that directly enhance human experience and interaction, and those that enable intelligent systems and environmental understanding.
IV-A Enhancing Human Experience and Interaction
DL-driven spatial audio enhances applications focused on human perception, communication, and immersion by creating more realistic and engaging sound environments.
IV-A1 Virtual and Augmented Reality
High-quality spatial audio is essential for creating immersion and presence in VR/AR [208, 209]. DL-powered synthesis contributes significantly to both the “place illusion” (the feeling of being there) and the “plausibility illusion” (the feeling that the scenario is real), making virtual experiences more authentic [210, 208]. Beyond realism, it serves as an attention guidance mechanism, directing users toward important events outside their limited field-of-view [211, 212]. It provides interaction feedback, confirms user actions, and improves social presence in multi-user applications by facilitating speaker localization and identification [18]. In AR applications, DL-based approaches are critical for integrating virtual sounds with real environments by correctly positioning them relative to physical objects [213].
IV-A2 Hearing Aids and Assistive Technologies
For hearing accessibility, DL-driven spatial audio processing offers powerful new solutions. Modern hearing aids can incorporate DL models that leverage spatial cues and perform advanced directional filtering - based on principles like the “cocktail party effect” [214] - to improve speech clarity in noisy settings. These systems enhance comprehension by spatially separating speech from background noise, potentially restoring spatial hearing abilities for those with hearing loss [215]. Furthermore, DL-based synthesis enables the creation of realistic virtual auditory environments for audiological assessment and rehabilitation [216, 217, 218], allowing clinicians to create complex listening scenarios for testing and therapy.
IV-A3 Telepresence and Enhanced Communication
In teleconferencing, spatial audio improves speech intelligibility by spatially separating speakers, thus reducing listening effort. This approach mimics real-world conversation dynamics, helping listeners focus on specific speakers in multi-talker scenarios. It enhances shared presence in virtual meetings by accurately representing participant locations within a common virtual acoustic space. This contributes to natural turn-taking cues and a stronger sense of co-location, particularly valuable in VR collaboration platforms [208] and hybrid meetings that bridge remote and physically present participants.
IV-B Enabling Intelligent Systems and Environmental Understanding
Beyond human experience, spatial audio cues interpreted by DL models equip intelligent systems with improved capabilities for perception, navigation, and interaction with the physical world. These advances also offer valuable tools for scientific research and design.
IV-B1 Audio-Visual Navigation and Robotics
Spatial sound guides agents or robots through environments, especially toward sound-emitting objects. Current research enables agents to navigate using combined audio-visual inputs. Key challenges that are being addressed with DL include locating static or dynamic sound sources [219, 220], understanding sound semantic context [221, 222], maintaining robustness amid distractors or noise [223], and bridging the simulation-to-reality gap [224]. Effective approaches incorporate multi-modal fusion, attention mechanisms, and reinforcement learning [225] to utilize directional audio cues for localization and path planning.
IV-B2 Acoustic Scene Understanding and Depth Estimation
Sound reflections, echoes, and source properties contain valuable geometric and semantic information about surrounding spaces. Studies show that spatial audio improves visual depth estimation in unclear regions and reveals properties of areas outside direct view [226, 227, 228]. DL models employ cross-modal fusion techniques to combine audio spatial cues with visual information, creating more accurate 3D scene reconstruction [229]. Spatial sound also enables audio-based semantic segmentation of environments [230, 231].
IV-B3 Cross-modal Representation Learning for Perception
The physical relationship between sound propagation and spatial environments makes spatial audio an effective supervisory signal for learning robust representations across modalities. Models inspired by echolocation can use sound to develop spatial representations from visual data [232]. Training advanced perception models relies on the creation of spatially consistent audio-visual data, a task for which DL-based synthesis is well-suited [233]. Additionally, modeling how physical bodies affect sound fields helps create accurate virtual representations and improves understanding of acoustic interactions [234].
V Challenges and Future Directions
DL has significantly advanced spatial audio reproduction technology, demonstrating great potential in HRTF personalization, binaural audio synthesis, and related applications. However, several key challenges remain unresolved. This section summarizes current research bottlenecks and explores future development trends.
V-A Data Availability and Diversity
The performance of data-driven DL methods is fundamentally limited by the quality and quantity of training data. Acquiring suitable datasets remains a major challenge in spatial audio reproduction research. High-quality HRTF measurement demands specialized equipment and controlled acoustic environments. Consequently, public datasets are limited in number and exhibit significant variations in measurement conditions and spatial sampling protocols. These inconsistencies affect cross-study comparability and model generalization capabilities [107, 108]. Binaural audio synthesis faces even greater challenges in obtaining large-scale real-world paired data [200]. This challenge grows more pronounced for complex scenarios with dynamic interactions and multi-modal inputs that need precise timing synchronization and careful annotation.
Future research should focus on developing larger-scale, more diverse datasets that follow standardized protocols and remain openly accessible. Effective techniques for combining heterogeneous data, as discussed in Section II-E, are essential, with INRs showing promise for handling irregular sampling patterns. Improved data normalization methods are needed to reduce biases between different datasets. While physical acoustic simulations offer a valuable tool for augmenting training data [32, 202], the field must also address the simulation-to-reality gap to ensure models trained on synthetic data perform robustly in the real world [46]. Furthermore, research into less-supervised learning paradigms - including self-supervised, pseudo-supervised and zero-shot learning – is crucial for reducing the reliance on costly annotated data. As data privacy concerns grow, distributed training frameworks like federated learning also warrant exploration [235, 236].
V-B Perceptual Validity and Evaluation
A critical limitation of current research is the difficulty in accurately evaluating the perceptual effect of synthesized spatial audio. Most objective evaluation metrics reflect signal-level similarity but correlate poorly with human auditory perception [237, 238, 239], particularly for complex attributes like immersion and externalization. This disconnect can lead to model optimization that diverges from the actual user experience. While subjective listening experiments provide the most reliable perceptual assessment, they are resource-intensive, often lack standardization, and are subject to significant inter-listener variability. Auditory models (AMs) offer a promising alternative by predicting perceptual outcomes like localization performance, but current models require improved accuracy, generalization, and coverage of complex perceptual phenomena [119].
Future work requires methodological breakthroughs in evaluation. Developing objective metrics that correlate strongly with key dimensions of auditory perception is essential. This might involve integrating more sophisticated psychoacoustic principles or using DL to directly predict subjective ratings from audio signals [132, 131]. The field would benefit from standardized subjective evaluation protocols and more efficient assessment methods, such as online crowdsourcing platforms or immersive virtual environments [240]. Enhanced AMs that can simulate individual differences and perception in complex multi-modal scenarios represent another important research direction [121, 123]. Finally, incorporating explainable artificial intelligence (XAI) techniques can help clarify relationships between a model’s internal behavior and its perceptual outcomes [241].
V-C Generalization Ability and Robustness
DL models often exhibit decreased performance when deployed in acoustic environments that differ from their training data. This issue is a major barrier to the practical deployment of spatial audio technologies. Models must generalize not only to new users and acoustic conditions but also handle reverberation, ambient noise, diverse sound sources, and variations in playback devices. They should also respond effectively to dynamic changes, such as head and source movements.
Addressing the gap between training data and real-world applications will likely require a combination of domain adaptation and transfer learning techniques [242, 243]. Well-designed data augmentation strategies help simulate a wider range of real-world conditions during training. Robust optimization methods, like adversarial training, can enhance model stability [244, 245]. A particularly promising direction is the integration of physical acoustic laws as prior knowledge; physics-informed neural networks (PINNs) [246, 247] have demonstrated potential in HRTF modeling for improved physical realism and generalization from sparse data [85, 248, 249, 250, 251, 252, 253]. Research on continual learning and online adaptation could enable models to adjust their parameters after deployment based on user feedback or changing environmental acoustics. Neural field models offer efficient representation and rendering of dynamic acoustic scenes [94, 183].
V-D Controllability, Interpretability, and Interactivity
Many advanced DL models, particularly end-to-end generative models, operate as ”black boxes”. Their internal mechanisms lack transparency, and users have limited means of exercising detailed control over the generated output. These limitations restrict their application in creative applications, complicate debugging, and hinder the development of truly personalized interaction systems.
XAI techniques are essential for revealing the decision-making processes of these models. Methods such as feature attribution and concept analysis can help diagnose model behaviour by examining how key acoustic features are represented internally [254, 241]. Designing modular or structured generative architectures that disentangle different auditory attributes could enable more precise user control [255, 145]. More intuitive user interfaces - ones that allow interaction through speech, text, gestures, or physiological signals - are also necessary to bridge the gap between user intent and model output. Advanced conditional generative models are needed to respond to higher-level semantic commands, enabling truly personalized and creative spatial audio [155, 156, 160].
V-E Computational Efficiency and Real-time Capability
High-performance DL models, including large-scale Transformers and diffusion models, often have substantial computational resources and memory requirements. This conflicts with the need for real-time processing and energy efficiency, which is particularly critical for deployment on mobile devices, VR/AR headsets, and hearing aids.
Model compression techniques, including knowledge distillation [256], network pruning, and quantization [257], are effective for reducing deployment requirements. Designing lightweight network architectures optimized for the specific characteristics of audio signals has also shown significant promise [168]. For generative models, improved sampling algorithms remain essential for real-time performance, especially for iterative models like diffusion models [258, 259]. While specialized hardware accelerators like graphics processing units (GPUs) can enhance processing speed, future advances will likely depend on hardware-software co-design to create high-performance, real-time spatial audio systems that balance computational costs with perceptual quality.
VI Conclusion
This survey has reviewed recent advances in the application of Deep Learning (DL) to spatial audio reproduction, with a particular focus on personalized binaural techniques. Our analysis demonstrates that DL is not merely an incremental improvement but is fundamentally reshaping core spatial audio technologies. In particular, DL has transformed HRTF modeling by enabling data-driven personalization at a scale previously unattainable. Simultaneously, significant progress in end-to-end binaural audio synthesis has facilitated robust spatial cue recovery and the sophisticated integration of multi-modal information. These technological advancements have a profound dual impact: they are creating more immersive and interactive environments for human listeners while also empowering intelligent systems with a more sophisticated understanding of the acoustical world. Despite this considerable progress, critical challenges remain in data availability, perceptual evaluation, and model performance. Addressing these bottlenecks is vital for the next generation of research, which will push spatial audio technologies towards greater realism, personalization, and accessibility.
Acknowledgments
This work is funded by the Special Fund for Talent Development in AI-Enabled Interdisciplinary Programs at East China Normal University (2024JCRC-08) and the National Natural Science Foundation of China (12411530075).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. C. Moore, An introduction to the psychology of hearing . Leiden: Brill, 2012.
- 2[2] J. Blauert, Spatial hearing: the psychophysics of human sound localization . Cambridge: MIT press, 1997.
- 3[3] C. Rajguru, M. Obrist, and G. Memoli, “Spatial soundscapes and virtual worlds: Challenges and opportunities,” Frontiers in Psychology , vol. 11, p. 569056, 2020.
- 4[4] B. Xie, “Spatial sound-history, principle, progress and challenge,” Chinese Journal of Electronics , vol. 29, no. 3, pp. 397–416, 2020.
- 5[5] T. Wendt, S. Van De Par, and S. D. Ewert, “A computationally-efficient and perceptually-plausible algorithm for binaural room impulse response simulation,” Journal of the Audio Engineering Society , vol. 62, no. 11, pp. 748–766, 2014.
- 6[6] C. Kirsch and S. D. Ewert, “Low-order filter approximation of diffraction for virtual acoustics,” in 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) . IEEE, 2021, pp. 341–345.
- 7[7] S. D. Ewert, “A filter representation of diffraction at infinite and finite wedges,” JASA Express Letters , vol. 2, no. 9, 2022.
- 8[8] C. Kirsch and S. D. Ewert, “A universal filter approximation of edge diffraction for geometrical acoustics,” IEEE/ACM Transactions on Audio, Speech, and Language Processing , vol. 31, pp. 1636–1651, 2023.
