Generative AI for Video Translation: A Scalable Architecture for Multilingual Video Conferencing
Amirkia Rafiei Oskooei, Eren Caglar, Ibrahim Sahin, Ayse Kayabay, Mehmet S. Aktas

TL;DR
This paper introduces a scalable system architecture for real-time multilingual video translation using generative AI, addressing latency and computational challenges to enable practical multi-user video conferencing.
Contribution
It proposes a novel architecture with turn-taking and segmented processing to reduce complexity and latency, validated through implementation and performance analysis on various hardware.
Findings
Achieves real-time throughput ($\tau < 1.0$) on modern GPUs.
Reduces computational complexity from quadratic to linear in multi-user scenarios.
User study confirms high acceptability of initial delay for seamless experience.
Abstract
The real-time deployment of cascaded generative AI pipelines for applications like video translation is constrained by significant system-level challenges. These include the cumulative latency of sequential model inference and the quadratic () computational complexity that renders multi-user video conferencing applications unscalable. This paper proposes and evaluates a practical system-level framework designed to mitigate these critical bottlenecks. The proposed architecture incorporates a turn-taking mechanism to reduce computational complexity from quadratic to linear in multi-user scenarios, and a segmented processing protocol to manage inference latency for a perceptually real-time experience. We implement a proof-of-concept pipeline and conduct a rigorous performance analysis across a multi-tiered hardware setup, including commodity (NVIDIA RTX 4060), cloud…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Embedded Systems Design Techniques
