SIRUP: A diffusion-based virtual upmixer of steering vectors for highly-directive spatialization with first-order ambisonics

Emilio Picard (RIKEN AIP; UP1 EMS); Diego Di Carlo (RIKEN AIP; IP Paris); Aditya Arie Nugraha (RIKEN AIP); Mathieu Fontaine (LTCI; IP Paris); Kazuyoshi Yoshii (RIKEN AIP)

arXiv:2602.17732·eess.AS·February 23, 2026

SIRUP: A diffusion-based virtual upmixer of steering vectors for highly-directive spatialization with first-order ambisonics

Emilio Picard (RIKEN AIP, UP1 EMS), Diego Di Carlo (RIKEN AIP, IP Paris), Aditya Arie Nugraha (RIKEN AIP), Mathieu Fontaine (LTCI, IP Paris), Kazuyoshi Yoshii (RIKEN AIP)

PDF

Open Access

TL;DR

SIRUP introduces a diffusion-based approach using a variational autoencoder to enhance steering vector upmixing and spatialization in first-order ambisonics, outperforming traditional methods in source localization and speech denoising.

Contribution

The paper proposes a novel diffusion-based model with a VAE for improved virtual upmixing of steering vectors from FOA data, addressing mutual dependency issues in spatial audio processing.

Findings

01

Significant improvement in steering vector upmixing accuracy

02

Enhanced source localization capabilities

03

Better speech denoising performance

Abstract

This paper presents virtual upmixing of steering vectors captured by a fewer-channel spherical microphone array. This challenge has conventionally been addressed by recovering the directions and signals of sound sources from first-order ambisonics (FOA) data, and then rendering the higher-order ambisonics (HOA) data using a physics-based acoustic simulator. This approach, however, struggles to handle the mutual dependency between the spatial directivity of source estimation and the spatial resolution of FOA ambisonics data. Our method, named SIRUP, employs a latent diffusion model architecture. Specifically, a variational autoencoder (VAE) is used to learn a compact encoding of the HOA data in a latent space and a diffusion model is then trained to generate the HOA embeddings, conditioned by the FOA data. Experimental results showed that SIRUP achieved a significant improvement compared…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Hearing Loss and Rehabilitation · Music and Audio Processing