Dual-path Transformer Based Neural Beamformer for Target Speech Extraction
Aoqi Guo, Sichong Qian, Baoxiang Li, Dazhi Gao

TL;DR
This paper proposes a dual-path transformer neural beamformer that enhances target speech extraction by combining time-domain cross-attention and frequency-domain self-attention, outperforming existing methods with fewer parameters.
Contribution
The introduction of a dual-path transformer supported neural beamformer that operates end-to-end, improving performance and reducing model complexity compared to prior approaches.
Findings
Outperforms current neural beamforming algorithms in speech separation
Reduces model parameter count significantly
Operates in a comprehensive end-to-end manner
Abstract
Neural beamformers, which integrate both pre-separation and beamforming modules, have demonstrated impressive effectiveness in target speech extraction. Nevertheless, the performance of these beamformers is inherently limited by the predictive accuracy of the pre-separation module. In this paper, we introduce a neural beamformer supported by a dual-path transformer. Initially, we employ the cross-attention mechanism in the time domain to extract crucial spatial information related to beamforming from the noisy covariance matrix. Subsequently, in the frequency domain, the self-attention mechanism is employed to enhance the model's ability to process frequency-specific details. By design, our model circumvents the influence of pre-separation modules, delivering performance in a more comprehensive end-to-end manner. Experimental results reveal that our model not only outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
