Shaping a Stabilized Video by Mitigating Unintended Changes for Concept-Augmented Video Editing
Mingce Guo, Jingxuan He, Shengeng Tang, Zhangye Wang, Lechao Cheng

TL;DR
This paper introduces a novel video editing method that improves stability and fidelity by using concept-augmented textual inversion and dual prior supervision, enabling more nuanced and stable text-driven video edits.
Contribution
It proposes a new framework combining concept-augmented textual inversion with dual prior supervision to enhance video editing stability and attribute control.
Findings
Produces more stable and lifelike videos
Outperforms state-of-the-art methods in stability and fidelity
Enables flexible, stylized video editing
Abstract
Text-driven video editing utilizing generative diffusion models has garnered significant attention due to their potential applications. However, existing approaches are constrained by the limited word embeddings provided in pre-training, which hinders nuanced editing targeting open concepts with specific attributes. Directly altering the keywords in target prompts often results in unintended disruptions to the attention mechanisms. To achieve more flexible editing easily, this work proposes an improved concept-augmented video editing approach that generates diverse and stable target videos flexibly by devising abstract conceptual pairs. Specifically, the framework involves concept-augmented textual inversion and a dual prior supervision mechanism. The former enables plug-and-play guidance of stable diffusion for video editing, effectively capturing target attributes for more stylized…
Peer Reviews
Decision·Submitted to ICLR 2025
* Quantitative/qualitative evaluation shows relatively improved performance. * The two proposed methods intuitively make sense.
* Writing needs improvement. It is difficult to distinguish what is being proposed and what are existing components in the method section. * Lack of technical novelty. I think the use of LoRA with Textual Inversion (or other inversion/personalization technique) is already widely used as open-source, and compared to these, the use of LoRA in the proposed Concept-Augmented Textual Inversion does not appear to be significantly different. Also, although it is minor, it is not convincing why this te
1. The overall presentation of the paper is easy to follow. Most of the details are clear and well-documented. 2. Based on the quantitative metrics, the model can outperform the previous baselines.
1. My main concern is that the overall quality of the edited videos is not satisfactory. Mainly, there are obvious changes (shape and color) between frames, so the videos don’t look consistent. I think this may be the inherent weakness of the approach of inflating Stable Diffusion with temporal layers. The base model doesn’t have video prior knowledge compared to a video diffusion model which has been trained on video datasets. Possible ways to improve can be adding motion prior such as adopting
1. The motivation for introducing the concept of video editing is clear and good. 2. The results are impressive, the concepts injected are consistent, and the backgrounds are generally consistent with the target videos. 3. The paper reads well and is easy to follow.
1. How authors obtain the results from comparing methods is not explained in the paper. Some of those don’t accept concept video/images input, and if not given the concept video, then direct comparison is a bit unfair. 2. The quantitative comparison table seems to be evaluated on both with or without concept videos. And the ablation for using or not using concept videos is missing. I suggest the author having separate dataset for using or not using concept videos, and this can further strengthe
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimedia Communication and Technology · Video Analysis and Summarization · Digital Rights Management and Security
MethodsSoftmax · Attention Is All You Need · Diffusion
