When Video Coding Meets Multimodal Large Language Models: A Unified   Paradigm for Video Coding

Pingping Zhang; Jinlong Li; Kecheng Chen; Meng Wang; Long Xu; Haoliang; Li; Nicu Sebe; Sam Kwong; Shiqi Wang

arXiv:2408.08093·cs.CV·February 17, 2025

When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding

Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, Long Xu, Haoliang, Li, Nicu Sebe, Sam Kwong, Shiqi Wang

PDF

Open Access

TL;DR

This paper introduces a novel unified paradigm for video coding that leverages multimodal large language models to enhance semantic and perceptual quality, enabling flexible encoding-decoding modes and efficient frame interpolation.

Contribution

It pioneers the integration of multimodal large language models into video coding, disentangling video content into modalities and enabling multiple reconstruction modes for improved quality.

Findings

01

TT2V mode achieves effective semantic reconstruction.

02

IT2V mode exhibits competitive perceptual consistency.

03

Proposed frame interpolation model ensures smooth motion cues.

Abstract

Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media