MoXaRt: Audio-Visual Object-Guided Sound Interaction for XR
Tianyu Xu, Sieun Kim, Qianhui Zheng, Ruoyu Xu, Tejasvi Ravi, Anuva Kulkarni, Katrina Passarella-Ward, Junyi Zhu, Adarsh Kowdle

TL;DR
MoXaRt is a real-time XR system that leverages audio-visual cues to separate and interact with multiple sound sources, significantly improving scene awareness and social engagement in complex acoustic environments.
Contribution
We introduce MoXaRt, a novel cascaded architecture that combines audio and visual cues for real-time sound source separation in XR, handling up to five concurrent sources with low latency.
Findings
Increased speech intelligibility by 36.2% in challenging environments.
Reduced cognitive load during complex sound interactions.
Validated with a dataset of 30 recordings and a 22-participant user study.
Abstract
In Extended Reality (XR), complex acoustic environments often overwhelm users, compromising both scene awareness and social engagement due to entangled sound sources. We introduce MoXaRt, a real-time XR system that uses audio-visual cues to separate these sources and enable fine-grained sound interaction. MoXaRt's core is a cascaded architecture that performs coarse, audio-only separation in parallel with visual detection of sources (e.g., faces, instruments). These visual anchors then guide refinement networks to isolate individual sources, separating complex mixes of up to 5 concurrent sources (e.g., 2 voices + 3 instruments) with ~2 second processing latency. We validate MoXaRt through a technical evaluation on a new dataset of 30 one-minute recordings featuring concurrent speech and music, and a 22-participant user study. Empirical results indicate that our system significantly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Multisensory perception and integration · Hearing Loss and Rehabilitation
