TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration

Meiqi Gong; Hao Zhang; Xunpeng Yi; Linfeng Tang; Jiayi Ma

arXiv:2508.17817·cs.CV·August 26, 2025

TemCoCo: Temporally Consistent Multi-modal Video Fusion with Visual-Semantic Collaboration

Meiqi Gong, Hao Zhang, Xunpeng Yi, Linfeng Tang, Jiayi Ma

PDF

TL;DR

TemCoCo introduces a novel video fusion framework that explicitly models temporal dependencies and visual-semantic collaboration, significantly improving temporal consistency and fusion quality in multi-modal videos.

Contribution

It is the first to incorporate explicit temporal modeling with visual-semantic collaboration in video fusion, enhancing consistency and semantic accuracy.

Findings

01

Outperforms existing methods on public datasets.

02

Achieves higher temporal consistency scores.

03

Demonstrates improved visual and semantic fidelity.

Abstract

Existing multi-modal fusion methods typically apply static frame-based image fusion techniques directly to video fusion tasks, neglecting inherent temporal dependencies and leading to inconsistent results across frames. To address this limitation, we propose the first video fusion framework that explicitly incorporates temporal modeling with visual-semantic collaboration to simultaneously ensure visual fidelity, semantic accuracy, and temporal consistency. First, we introduce a visual-semantic interaction module consisting of a semantic branch and a visual branch, with Dinov2 and VGG19 employed for targeted distillation, allowing simultaneous enhancement of both the visual and semantic representations. Second, we pioneer integrate the video degradation enhancement task into the video fusion pipeline by constructing a temporal cooperative module, which leverages temporal dependencies to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.