MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Tianyu Xu; Sieun Kim; Qianhui Zheng; Ruoyu Xu; Tejasvi Ravi; Anuva Kulkarni; Katrina Passarella-Ward; Junyi Zhu; Adarsh Kowdle

arXiv:2603.10465·cs.SD·March 12, 2026

MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR

Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, Adarsh Kowdle

PDF

Open Access

TL;DR

MoXaRt is a real-time XR system that leverages audio-visual cues to separate and interact with multiple sound sources, significantly improving scene awareness and social engagement in complex acoustic environments.

Contribution

We introduce MoXaRt, a novel cascaded architecture that combines audio and visual cues for real-time sound source separation in XR, handling up to five concurrent sources with low latency.

Findings

01

Increased speech intelligibility by 36.2% in challenging environments.

02

Reduced cognitive load during complex sound interactions.

03

Validated with a dataset of 30 recordings and a 22-participant user study.

Abstract

In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Multisensory perception and integration · Hearing Loss and Rehabilitation