Multichannel-based learning for audio object extraction
Daniel Arteaga, Jordi Pons

TL;DR
This paper introduces a deep learning method for extracting audio objects from multichannel recordings, addressing scalability issues in complex audio productions and allowing supervised or unsupervised learning approaches.
Contribution
It presents a novel deep learning framework that learns from multichannel renders, enabling scalable audio object extraction and defining new evaluation standards.
Findings
The method effectively handles dozens of simultaneous audio objects.
It outperforms baseline methods under certain conditions.
The approach supports both supervised and unsupervised learning modes.
Abstract
The current paradigm for creating and deploying immersive audio content is based on audio objects, which are composed of an audio track and position metadata. While rendering an object-based production into a multichannel mix is straightforward, the reverse process involves sound source separation and estimating the spatial trajectories of the extracted sources. Besides, cinematic object-based productions are often composed by dozens of simultaneous audio objects, which poses a scalability challenge for audio object extraction. Here, we propose a novel deep learning approach to object extraction that learns from the multichannel renders of object-based productions, instead of directly learning from the audio objects themselves. This approach allows tackling the object scalability challenge and also offers the possibility to formulate the problem in a supervised or an unsupervised…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
