Environmental Sound Extraction Using Onomatopoeic Words
Yuki Okamoto, Shota Horiguchi, Masaaki Yamamoto, Keisuke Imoto, Yohei, Kawaguchi

TL;DR
This paper introduces a novel method for extracting specific environmental sounds from audio mixtures by leveraging onomatopoeic words to specify the target sound, utilizing a U-Net based approach for improved accuracy.
Contribution
The paper presents a new sound extraction technique that uses onomatopoeic words to specify targets, outperforming traditional sound-event class methods.
Findings
Effective extraction of target sounds using onomatopoeic words
Outperforms conventional sound-event class-based methods
Demonstrates the feasibility of linguistic cues in sound separation
Abstract
An onomatopoeic word, which is a character sequence that phonetically imitates a sound, is effective in expressing characteristics of sound such as duration, pitch, and timbre. We propose an environmental-sound-extraction method using onomatopoeic words to specify the target sound to be extracted. By this method, we estimate a time-frequency mask from an input mixture spectrogram and an onomatopoeic word using a U-Net architecture, then extract the corresponding target sound by masking the spectrogram. Experimental results indicate that the proposed method can extract only the target sound corresponding to the onomatopoeic word and performs better than conventional methods that use sound-event classes to specify the target sound.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Music Technology and Sound Studies · Speech and Audio Processing
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Concatenated Skip Connection · Convolution · Max Pooling · U-Net
