YODA: Yet Another One-step Diffusion-based Video Compressor

Xingchen Li; Junzhe Zhang; Junqi Shi; Ming Lu; Zhan Ma

arXiv:2601.01141·eess.IV·January 6, 2026

YODA: Yet Another One-step Diffusion-based Video Compressor

Xingchen Li, Junzhe Zhang, Junqi Shi, Ming Lu, Zhan Ma

PDF

Open Access

TL;DR

YODA introduces a novel one-step diffusion-based video compression method that leverages multiscale temporal features and a linear Diffusion Transformer, achieving state-of-the-art perceptual quality and outperforming existing methods.

Contribution

The paper proposes YODA, a new one-step diffusion model for video compression that incorporates temporal features and a linear Diffusion Transformer for improved spatial-temporal correlation exploitation.

Findings

01

YODA outperforms traditional and deep-learning baselines on perceptual metrics.

02

YODA achieves state-of-the-art results on LPIPS, DISTS, FID, and KID.

03

The source code will be publicly available.

Abstract

While one-step diffusion models have recently excelled in perceptual image compression, their application to video remains limited. Prior efforts typically rely on pretrained 2D autoencoders that generate per-frame latent representations independently, thereby neglecting temporal dependencies. We present YODA--Yet Another One-step Diffusion-based Video Compressor--which embeds multiscale features from temporal references for both latent generation and latent coding to better exploit spatial-temporal correlations for more compact representation, and employs a linear Diffusion Transformer (DiT) for efficient one-step denoising. YODA achieves state-of-the-art perceptual performance, consistently outperforming traditional and deep-learning baselines on LPIPS, DISTS, FID, and KID. Source code will be publicly available at https://github.com/NJUVISION/YODA.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment · Advanced Data Compression Techniques