VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Baolu Li; Yiming Zhang; Qinghe Wang; Liqian Ma; Xiaoyu Shi; Xintao Wang; Pengfei Wan; Zhenfei Yin; Yunzhi Zhuge; Huchuan Lu; Xu Jia

arXiv:2510.25772·cs.CV·October 30, 2025

VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning

Baolu Li, Yiming Zhang, Qinghe Wang, Liqian Ma, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Zhenfei Yin, Yunzhi Zhuge, Huchuan Lu, Xu Jia

PDF

4 Reviews

TL;DR

VFXMaster introduces a unified, reference-based in-context learning framework for dynamic visual effect generation that generalizes well to unseen effects, overcoming resource limitations of previous methods.

Contribution

It is the first to recast VFX generation as an in-context learning task, enabling effective effect imitation and rapid one-shot adaptation with a single model.

Findings

01

Effective effect imitation across diverse categories

02

Strong generalization to unseen effects

03

Rapid one-shot effect adaptation

Abstract

Visual effects (VFX) are crucial to the expressive power of digital media, yet their creation remains a major challenge for generative AI. Prevailing methods often rely on the one-LoRA-per-effect paradigm, which is resource-intensive and fundamentally incapable of generalizing to unseen effects, thus limiting scalability and creation. To address this challenge, we introduce VFXMaster, the first unified, reference-based framework for VFX video generation. It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content. In addition, it demonstrates remarkable generalization to unseen effect categories. Specifically, we design an in-context conditioning strategy that prompts the model with a reference example. An in-context attention mask is designed to precisely decouple and inject the essential…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 3

Strengths

* The writing is fluent and logically coherent, exhibiting strong readability. * The proposed method is highly efficient, requiring only a small number of model parameters to be fine-tuned in order to learn various VFX effects. * Comprehensive qualitative and quantitative experiments demonstrate the effectiveness of the proposed method.

Weaknesses

* The proposed method lacks novelty. For the in-domain training part, it appears to be a straightforward extension of Custom Diffusion to the video domain. The out-of-domain part, on the other hand, resembles a modified version of prompt tuning. * The ablation study appears somewhat coarse. According to Table 2, the Attention Mask has a significant impact on the performance of VFXMaster. Conducting a single, superficial ablation on the Attention Mask is insufficient. It would be helpful to adjus

Reviewer 02Rating 4Confidence 4

Strengths

- This paper presents a unified reference-based pipeline for visual effect video generation with an in-context attention mask. Compared to previous works, the motivation of this paper is clear. Instead of tuning one lora for each visual effect, this paper aims to handle all visual effects in a single framework, which is meaningful for this research topic. - The visual results and numerical results look great compared to previous works.

Weaknesses

- In the introduction section, the authors say that most previous VFX generation methods are based on Lora finetuning and they list many references in the second and third paragraphs. However, in the experiment section, it seems that the authors did not compare their approach with these mentioned works. It is hard to say that the proposed approach performs better than these models. - Presenting a unified model for VFX generation is an interesting work. However, the technical contributions of th

Reviewer 03Rating 4Confidence 4

Strengths

1. Novel Problem Formulation: The most significant strength is the shift from specialized, closed-set VFX models to a unified, general-purpose imitation framework. By framing the task as in-context learning, the paper presents an elegant solution to the critical challenges of scalability and generalization that have limited prior work. 2. Effective Architectural Design: The in-context attention mask is a crucial and well-motivated component. The ablation study convincingly demonstrates its nece

Weaknesses

1. Ambiguity and Potential Flaw in the VFX-Cons. Metric: The paper's new metric, VFX-Cons., is calculated as (EOS + EFS + CLS) / 3. However, the paper states, "CLS is only meaningful when EFS is True." The formula does not reflect this dependency. For example, if a video has the effect occur (EOS=True) but the fidelity is wrong (EFS=False), what is the value of CLS? If it is judged as True (no leakage), the score would be (1 + 0 + 1) / 3 = 0.67. If it is judged as False, the score is (1 + 0 + 0)

Reviewer 04Rating 2Confidence 5

Strengths

1. The framework does not need one lora per effect, increasing the scalability of the model. 2. Strong empirical results are presented. It achieves better performance compared to competitors like VFX Creator and Omini-Effects.

Weaknesses

1. The major concern is the limited novelty of the proposed method. The proposed in-context conditioning for VFX generation is quite straightforward. The example-query in-context learning is already common in the generation field, many works adopts a similar idea (i.e. IP-Adapter, PuLID). The in-context attention mask is also not new. 2. There is no comprehensive studies on the design of attention mask. In ablation, only with and without attention mask results are compared. However, more ablati

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.