SIREN: Spatially-Informed Reconstruction of Binaural Audio with Vision
Mingyeong Song, Seoyeon Ko, Junhyug Noh

TL;DR
SIREN is a novel framework that converts monaural videos into binaural audio using vision-guided neural attention and waveform fusion, enhancing spatial audio reconstruction without task-specific annotations.
Contribution
It introduces a visually guided mono to binaural conversion method with a ViT-based encoder, dual-head self-attention, and confidence-weighted waveform fusion, improving spatial audio quality.
Findings
SIREN achieves consistent improvements on FAIR-Play and MUSIC-Stereo datasets.
It outperforms baseline methods on time-frequency and phase-sensitive metrics.
The approach is modular, generic, and integrates seamlessly with standard pipelines.
Abstract
Binaural audio delivers spatial cues essential for immersion, yet most consumer videos are monaural due to capture constraints. We introduce SIREN, a visually guided mono to binaural framework that explicitly predicts left and right channels. A ViT-based encoder learns dual-head self-attention to produce a shared scene map and end-to-end L/R attention, replacing hand-crafted masks. A soft, annealed spatial prior gently biases early L/R grounding, and a two-stage, confidence-weighted waveform-domain fusion (guided by mono reconstruction and interaural phase consistency) suppresses crosstalk when aggregating multi-crop and overlapping windows. Evaluated on FAIR-Play and MUSIC-Stereo, SIREN yields consistent gains on time-frequency and phase-sensitive metrics with competitive SNR. The design is modular and generic, requires no task-specific annotations, and integrates with standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
