AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation   with Unified Audio-Visual Speech Representation

Jeongsoo Choi; Se Jin Park; Minsu Kim; Yong Man Ro

arXiv:2312.02512·cs.CV·March 27, 2024·2 cites

AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation

Jeongsoo Choi, Se Jin Park, Minsu Kim, Yong Man Ro

PDF

Open Access 1 Repo

TL;DR

This paper introduces AV2AV, a novel framework for direct audio-visual speech translation that enhances conversational realism and robustness by translating between audio-visual modalities using unified representations and self-supervised learning.

Contribution

It presents a unified audio-visual speech representation learning method and an AV-Renderer for zero-shot speaker modeling, enabling direct AV speech translation without parallel datasets.

Findings

01

Effective translation in noisy environments

02

Improved synchronization of lip movements and speech

03

Successful many-to-many language translation experiments

Abstract

This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework, where the input and output of the system are multimodal (i.e., audio and visual speech). With the proposed AV2AV, two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A), which solely translates between audio modalities, the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech, the system can effectively translate spoken language even in the presence…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

choijeongsoo/av2av
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Subtitles and Audiovisual Media · Video Analysis and Summarization