TL;DR
VFXMaster introduces a unified, reference-based in-context learning framework for dynamic visual effect generation that generalizes well to unseen effects, overcoming resource limitations of previous methods.
Contribution
It is the first to recast VFX generation as an in-context learning task, enabling effective effect imitation and rapid one-shot adaptation with a single model.
Findings
Effective effect imitation across diverse categories
Strong generalization to unseen effects
Rapid one-shot effect adaptation
Abstract
Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
* The writing is fluent and logically coherent, exhibiting strong readability. * The proposed method is highly efficient, requiring only a small number of model parameters to be fine-tuned in order to learn various VFX effects. * Comprehensive qualitative and quantitative experiments demonstrate the effectiveness of the proposed method.
* The proposed method lacks novelty. For the in-domain training part, it appears to be a straightforward extension of Custom Diffusion to the video domain. The out-of-domain part, on the other hand, resembles a modified version of prompt tuning. * The ablation study appears somewhat coarse. According to Table 2, the Attention Mask has a significant impact on the performance of VFXMaster. Conducting a single, superficial ablation on the Attention Mask is insufficient. It would be helpful to adjus
- This paper presents a unified reference-based pipeline for visual effect video generation with an in-context attention mask. Compared to previous works, the motivation of this paper is clear. Instead of tuning one lora for each visual effect, this paper aims to handle all visual effects in a single framework, which is meaningful for this research topic. - The visual results and numerical results look great compared to previous works.
- In the introduction section, the authors say that most previous VFX generation methods are based on Lora finetuning and they list many references in the second and third paragraphs. However, in the experiment section, it seems that the authors did not compare their approach with these mentioned works. It is hard to say that the proposed approach performs better than these models. - Presenting a unified model for VFX generation is an interesting work. However, the technical contributions of th
1. Novel Problem Formulation: The most significant strength is the shift from specialized, closed-set VFX models to a unified, general-purpose imitation framework. By framing the task as in-context learning, the paper presents an elegant solution to the critical challenges of scalability and generalization that have limited prior work. 2. Effective Architectural Design: The in-context attention mask is a crucial and well-motivated component. The ablation study convincingly demonstrates its nece
1. Ambiguity and Potential Flaw in the VFX-Cons. Metric: The paper's new metric, VFX-Cons., is calculated as (EOS + EFS + CLS) / 3. However, the paper states, "CLS is only meaningful when EFS is True." The formula does not reflect this dependency. For example, if a video has the effect occur (EOS=True) but the fidelity is wrong (EFS=False), what is the value of CLS? If it is judged as True (no leakage), the score would be (1 + 0 + 1) / 3 = 0.67. If it is judged as False, the score is (1 + 0 + 0)
1. The framework does not need one lora per effect, increasing the scalability of the model. 2. Strong empirical results are presented. It achieves better performance compared to competitors like VFX Creator and Omini-Effects.
1. The major concern is the limited novelty of the proposed method. The proposed in-context conditioning for VFX generation is quite straightforward. The example-query in-context learning is already common in the generation field, many works adopts a similar idea (i.e. IP-Adapter, PuLID). The in-context attention mask is also not new. 2. There is no comprehensive studies on the design of attention mask. In ablation, only with and without attention mask results are compared. However, more ablati
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
