TL;DR
HighSync is a diffusion-based framework that produces high-resolution, photorealistic lip-synced videos from arbitrary audio, overcoming previous quality and synchronization limitations.
Contribution
It introduces a novel diffusion model operating at 512x512 resolution for lip sync, addressing data leakage issues that hinder temporal modeling in prior methods.
Findings
Achieves state-of-the-art lip synchronization accuracy.
Operates at 512x512 resolution for professional-quality videos.
Effectively eliminates data leakage to improve temporal consistency.
Abstract
We present HighSync, an end-to-end diffusion-based framework for high-fidelity lip synchronization that generates photorealistic talking-face videos aligned with arbitrary input audio. Existing approaches consistently struggle to reconcile image quality with synchronization accuracy, producing either visually degraded outputs or temporally inconsistent lip movements. HighSync addresses both challenges simultaneously and, to our knowledge, is the first lip sync model to operate natively at 512*512 resolution, positioning it as a viable solution for professional production environments such as the film and broadcast industries. Central to our approach is the identification and systematic elimination of a data leakage phenomenon that has silently undermined temporal modeling in prior work, preventing models from developing a genuine dependence on the audio signal. Comprehensive evaluations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
