Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing

Amirkia Rafiei Oskooei; Eren Caglar; Ibrahim Sahin; Ayse Kayabay; Mehmet S. Aktas

arXiv:2512.13904·cs.MM·December 17, 2025

Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing

Amirkia Rafiei Oskooei, Eren Caglar, Ibrahim Sahin, Ayse Kayabay, Mehmet S. Aktas

PDF

Open Access

TL;DR

This paper introduces a scalable system architecture for real-time multilingual video translation using generative AI, addressing latency and computational challenges to enable practical multi-user video conferencing.

Contribution

It proposes a novel architecture with turn-taking and segmented processing to reduce complexity and latency, validated through implementation and performance analysis on various hardware.

Findings

01

Achieves real-time throughput ($\tau < 1.0$) on modern GPUs.

02

Reduces computational complexity from quadratic to linear in multi-user scenarios.

03

User study confirms high acceptability of initial delay for seamless experience.

Abstract

The real-time deployment of cascaded generative AI pipelines for applications like video translation is constrained by significant system-level challenges. These include the cumulative latency of sequential model inference and the quadratic ( $O (N^{2})$ ) computational complexity that renders multi-user video conferencing applications unscalable. This paper proposes and evaluates a practical system-level framework designed to mitigate these critical bottlenecks. The proposed architecture incorporates a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios, and a segmented processing protocol to manage inference latency for a perceptually real-time experience. We implement a proof-of-concept pipeline and conduct a rigorous performance analysis across a multi-tiered hardware setup, including commodity (NVIDIA RTX 4060), cloud…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Embedded Systems Design Techniques