Multichannel-to-Multichannel Target Sound Extraction Using Direction and Timestamp Clues
Dayun Choi, Jung-Woo Choi

TL;DR
This paper introduces a transformer-based multichannel sound extraction framework that uses spatial and temporal clues, such as direction-of-arrival and timestamps, to effectively isolate target sounds from multichannel audio mixtures in various environments.
Contribution
It presents a novel M2M target sound extraction method leveraging spatio-temporal clues and demonstrates that the transformer architecture can handle DoA cues without handcrafted features.
Findings
Successfully extracts multichannel target signals using spatial and temporal clues.
Handles diverse room environments and sound classes effectively.
Eliminates the need for handcrafted spatial features in DoA-based extraction.
Abstract
We propose a multichannel-to-multichannel target sound extraction (M2M-TSE) framework for separating multichannel target signals from a multichannel mixture of sound sources. Target sound extraction (TSE) isolates a specific target signal using user-provided clues, typically focusing on single-channel extraction with class labels or temporal activation maps. However, to preserve and utilize spatial information in multichannel audio signals, it is essential to extract multichannel signals of a target sound source. Moreover, the clue for extraction can also include spatial or temporal cues like direction-of-arrival (DoA) or timestamps of source activation. To address these challenges, we present an M2M framework that extracts a multichannel sound signal based on spatio-temporal clues. We demonstrate that our transformer-based architecture can successively accomplish the M2M-TSE task for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
