MIMOSA: Human-AI Co-Creation of Computational Spatial Audio Effects on Videos
Zheng Ning, Zheng Zhang, Jerrick Ban, Kaiwen Jiang, Ruohong Gan,, Yapeng Tian, Toby Jia-Jun Li

TL;DR
MIMOSA is a human-AI co-creation tool that simplifies the process of generating and editing immersive spatial audio effects in videos for amateur users, using an interpretable pipeline rather than black-box models.
Contribution
It introduces a novel, interpretable pipeline for spatial audio editing that enables amateur users to generate, validate, and customize audio effects collaboratively with AI.
Findings
Lab study shows high usability and usefulness.
Participants could create immersive spatial audio effects.
The tool supports creative customization of audio effects.
Abstract
Spatial audio offers more immersive video consumption experiences to viewers; however, creating and editing spatial audio often expensive and requires specialized equipment and skills, posing a high barrier for amateur video creators. We present MIMOSA, a human-AI co-creation tool that enables amateur users to computationally generate and manipulate spatial audio effects. For a video with only monaural or stereo audio, MIMOSA automatically grounds each sound source to the corresponding sounding object in the visual scene and enables users to further validate and fix the errors in the locations of sounding objects. Users can also augment the spatial audio effect by flexibly manipulating the sounding source positions and creatively customizing the audio effect. The design of MIMOSA exemplifies a human-AI collaboration approach that, instead of utilizing state-of art end-to-end "black-box"…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
