SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin   Transformer

Young-Hu Park; Rae-Hong Park; Hyung-Min Park

arXiv:2505.04394·cs.CV·May 9, 2025

SwinLip: An Efficient Visual Speech Encoder for Lip Reading Using Swin Transformer

Young-Hu Park, Rae-Hong Park, Hyung-Min Park

PDF

TL;DR

This paper introduces SwinLip, a lightweight and efficient visual speech encoder based on Swin Transformer, which improves lip reading accuracy and speed while reducing computational complexity, outperforming existing models on multiple datasets.

Contribution

The paper proposes a novel Swin Transformer-based lip reading encoder, SwinLip, that enhances performance and efficiency over traditional CNN-based models, with state-of-the-art results on Mandarin LRW-1000.

Findings

01

SwinLip improves lip reading accuracy on LRW and LRW-1000 datasets.

02

SwinLip reduces computational load compared to previous models.

03

SwinLip achieves state-of-the-art performance on Mandarin LRW-1000.

Abstract

This paper presents an efficient visual speech encoder for lip reading. While most recent lip reading studies have been based on the ResNet architecture and have achieved significant success, they are not sufficiently suitable for efficiently capturing lip reading features due to high computational complexity in modeling spatio-temporal information. Additionally, using a complex visual model not only increases the complexity of lip reading models but also induces delays in the overall network for multi-modal studies (e.g., audio-visual speech recognition, speech enhancement, and speech separation). To overcome the limitations of Convolutional Neural Network (CNN)-based models, we apply the hierarchical structure and window self-attention of the Swin Transformer to lip reading. We configure a new lightweight scale of the Swin Transformer suitable for processing lip reading data and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAverage Pooling · Global Average Pooling · Linear Layer · Convolution · Stochastic Depth · Kaiming Initialization · Multi-Head Attention · Dense Connections · Adam · Max Pooling