FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

Mingshu Cai; Yixuan Li; Osamu Yoshie; Yuya Ieiri

arXiv:2512.21015·cs.CV·January 9, 2026

FluencyVE: Marrying Temporal-Aware Mamba with Bypass Attention for Video Editing

Mingshu Cai, Yixuan Li, Osamu Yoshie, Yuya Ieiri

PDF

Open Access

TL;DR

FluencyVE introduces a novel, efficient one-shot video editing method that integrates temporal-aware modules into pretrained diffusion models, significantly reducing computational costs while maintaining high editing quality.

Contribution

The paper presents FluencyVE, a new approach that replaces temporal attention with Mamba and low-rank approximations, enabling fast, high-quality video editing with reduced computational overhead.

Findings

01

Effective editing of various video attributes, subjects, and locations.

02

Significant reduction in computational costs compared to existing methods.

03

Maintains strong generative quality in video editing tasks.

Abstract

Large-scale text-to-image diffusion models have achieved unprecedented success in image generation and editing. However, extending this success to video editing remains challenging. Recent video editing efforts have adapted pretrained text-to-image models by adding temporal attention mechanisms to handle video tasks. Unfortunately, these methods continue to suffer from temporal inconsistency issues and high computational overheads. In this study, we propose FluencyVE, which is a simple yet effective one-shot video editing approach. FluencyVE integrates the linear time-series module, Mamba, into a video editing model based on pretrained Stable Diffusion models, replacing the temporal attention layer. This enables global frame-level attention while reducing the computational costs. In addition, we employ low-rank approximation matrices to replace the query and key weight matrices in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Video Analysis and Summarization