IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation
Kai Li, Runxuan Yang, Fuchun Sun, Xiaolin Hu

TL;DR
IIANet introduces a novel attention-based model for audio-visual speech separation, inspired by human brain mechanisms, achieving superior performance and efficiency on standard benchmarks.
Contribution
The paper proposes IIANet, a new model with intra- and inter-attention blocks for improved multi-scale audio-visual feature fusion in speech separation.
Findings
Outperforms previous state-of-the-art methods on LRS2, LRS3, and VoxCeleb2 benchmarks.
IIANet-fast is 40% faster than CTCNet on CPUs with better separation quality.
Achieves effective multimodal fusion using attention mechanisms inspired by human cognition.
Abstract
Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis
MethodsFocus
