# MVIB-Lip: Multi-View Information Bottleneck for Visual Speech Recognition via Time Series Modeling

**Authors:** Yuzhe Li, Haocheng Sun, Jiayi Cai, Jin Wu

PMC · DOI: 10.3390/e27111121 · 2025-10-31

## TL;DR

This paper introduces MVIB-Lip, a new framework for visual speech recognition that combines time series and image-based representations to improve accuracy and generalization.

## Contribution

The novel contribution is the integration of multivariate time series and recurrence plot images with a multi-view information bottleneck for lipreading.

## Key findings

- MVIB-Lip outperforms handcrafted baselines in visual speech recognition tasks.
- The framework improves generalization to speaker-independent recognition.
- Recurrence plots enhance data efficiency when combined with deep multi-view learning.

## Abstract

Lipreading, or visual speech recognition, is the task of interpreting utterances solely from visual cues of lip movements. While early approaches relied on Hidden Markov Models (HMMs) and handcrafted spatiotemporal descriptors, recent advances in deep learning have enabled end-to-end recognition using large-scale datasets. However, such methods often require millions of labeled or pretraining samples and struggle to generalize under low-resource or speaker-independent conditions. In this work, we revisit lipreading from a multi-view learning perspective. We introduce MVIB-Lip, a framework that integrates two complementary representations of lip movements: (i) raw landmark trajectories modeled as multivariate time series, and (ii) recurrence plot (RP) images that encode structural dynamics in a texture form. A Transformer encoder processes the temporal sequences, while a ResNet-18 extracts features from RPs; the two views are fused via a product-of-experts posterior regularized by the multi-view information bottleneck. Experiments on the OuluVS and a self-collected dataset demonstrate that MVIB-Lip consistently outperforms handcrafted baselines and improves generalization to speaker-independent recognition. Our results suggest that recurrence plots, when coupled with deep multi-view learning, offer a principled and data-efficient path forward for robust visual speech recognition.

## Full-text entities

- **Chemicals:** MVIB-Lip (-)

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12651204/full.md

---
Source: https://tomesphere.com/paper/PMC12651204