Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Lingdong Wang; Guan-Ming Su; Divya Kothandaraman; Tsung-Wei Huang; Mohammad Hajiesmaili; Ramesh K. Sitaraman

arXiv:2512.00408·cs.CV·April 7, 2026

Low-Bitrate Video Compression through Semantic-Conditioned Diffusion

Lingdong Wang, Guan-Ming Su, Divya Kothandaraman, Tsung-Wei Huang, Mohammad Hajiesmaili, Ramesh K. Sitaraman

PDF

TL;DR

The paper introduces DiSCo, a semantic video compression method that uses minimal data and generative models to produce high-quality videos at ultra-low bitrates, outperforming traditional codecs.

Contribution

DiSCo is a novel semantic compression framework that decomposes videos into semantic, appearance, and motion cues, enabling efficient transmission and high-quality reconstruction.

Findings

01

Outperforms baseline codecs by 2-10X on perceptual metrics at low bitrates.

02

Uses a conditional diffusion model for high-quality, temporally coherent video synthesis.

03

Employs multimodal representations including text, degraded video, and sketches or poses.

Abstract

Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.