IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual   Speech Separation

Kai Li; Runxuan Yang; Fuchun Sun; Xiaolin Hu

arXiv:2308.08143·cs.SD·February 5, 2024

IIANet: An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

Kai Li, Runxuan Yang, Fuchun Sun, Xiaolin Hu

PDF

Open Access 1 Repo

TL;DR

IIANet introduces a novel attention-based model for audio-visual speech separation, inspired by human brain mechanisms, achieving superior performance and efficiency on standard benchmarks.

Contribution

The paper proposes IIANet, a new model with intra- and inter-attention blocks for improved multi-scale audio-visual feature fusion in speech separation.

Findings

01

Outperforms previous state-of-the-art methods on LRS2, LRS3, and VoxCeleb2 benchmarks.

02

IIANet-fast is 40% faster than CTCNet on CPUs with better separation quality.

03

Achieves effective multimodal fusion using attention mechanisms inspired by human cognition.

Abstract

Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this issue, We propose a novel model called Intra- and Inter-Attention Network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

JusperLee/IIANet
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Speech Recognition and Synthesis

MethodsFocus