When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding
Pingping Zhang, Jinlong Li, Kecheng Chen, Meng Wang, Long Xu, Haoliang, Li, Nicu Sebe, Sam Kwong, Shiqi Wang

TL;DR
This paper introduces a novel unified paradigm for video coding that leverages multimodal large language models to enhance semantic and perceptual quality, enabling flexible encoding-decoding modes and efficient frame interpolation.
Contribution
It pioneers the integration of multimodal large language models into video coding, disentangling video content into modalities and enabling multiple reconstruction modes for improved quality.
Findings
TT2V mode achieves effective semantic reconstruction.
IT2V mode exhibits competitive perceptual consistency.
Proposed frame interpolation model ensures smooth motion cues.
Abstract
Existing codecs are designed to eliminate intrinsic redundancies to create a compact representation for compression. However, strong external priors from Multimodal Large Language Models (MLLMs) have not been explicitly explored in video compression. Herein, we introduce a unified paradigm for Cross-Modality Video Coding (CMVC), which is a pioneering approach to explore multimodality representation and video generative models in video coding. Specifically, on the encoder side, we disentangle a video into spatial content and motion components, which are subsequently transformed into distinct modalities to achieve very compact representation by leveraging MLLMs. During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes that optimize video reconstruction quality for specific decoding requirements, including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media
