# Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio

**Authors:** Jeong Hun Yeo, Hyeongseop Rha, Sungjune Park, Junil Won, Yong Man Ro

arXiv: 2508.20476 · 2026-03-25

## TL;DR

This paper introduces a unified, modality-agnostic framework for generating spoken language from sign language, lip movements, and audio, enhancing inclusivity and integrating multiple communication modalities.

## Contribution

It presents the first unified model capable of processing diverse input modalities for speech generation, combining sign language, lip movements, and audio in a single architecture.

## Key findings

- Achieves state-of-the-art or superior performance across SLT, VSR, ASR, and Audio-Visual Speech Recognition tasks.
- Explicit modeling of lip movements as a separate modality improves translation accuracy.
- The unified framework effectively leverages the synergy among different communication modalities.

## Abstract

Audio is the primary modality for human communication and has driven the success of Automatic Speech Recognition (ASR) technologies. However, such audio-centric systems inherently exclude individuals who are deaf or hard of hearing. Visual alternatives such as sign language and lip reading offer effective substitutes, and recent advances in Sign Language Translation (SLT) and Visual Speech Recognition (VSR) have improved audio-less communication. Yet, these modalities have largely been studied in isolation, and their integration within a unified framework remains underexplored. In this paper, we propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation. We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or superior to state-of-the-art models specialized for individual tasks. Building on this framework, we achieve performance on par with or better than task-specific state-of-the-art models across SLT, VSR, ASR, and Audio-Visual Speech Recognition. Furthermore, our analysis reveals a key linguistic insight: explicitly modeling lip movements as a distinct modality significantly improves SLT performance by capturing critical non-manual cues.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.20476/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/2508.20476/full.md

## References

80 references — full list in the complete paper: https://tomesphere.com/paper/2508.20476/full.md

---
Source: https://tomesphere.com/paper/2508.20476