Many-for-Many: Unify the Training of Multiple Video and Image Generation and Manipulation Tasks
Ruibin Li, Tao Yang, Yangming Shi, Weiguo Feng, Shilei Wen, Bingyue Peng, Lei Zhang

TL;DR
This paper presents a unified many-for-many framework that trains a single model for multiple visual generation and manipulation tasks, leveraging diverse data to improve performance and reduce annotation costs.
Contribution
The authors introduce a lightweight adapter and joint learning strategy to unify training across various tasks, enabling a single model to perform over ten tasks with competitive results.
Findings
Model performs more than 10 tasks effectively.
8B model shows competitive video generation performance.
Joint training improves visual generation quality.
Abstract
Diffusion models have shown impressive performance in many visual generation and manipulation tasks. Many existing methods focus on training a model for a specific task, especially, text-to-video (T2V) generation, while many other works focus on finetuning the pretrained T2V model for image-to-video (I2V), video-to-video (V2V), image and video manipulation tasks, etc. However, training a strong T2V foundation model requires a large amount of high-quality annotations, which is very costly. In addition, many existing models can perform only one or several tasks. In this work, we introduce a unified framework, namely many-for-many, which leverages the available training data from many different visual generation and manipulation tasks to train a single model for those different tasks. Specifically, we design a lightweight adapter to unify the different conditions in different tasks, then…
Peer Reviews
Decision·ICLR 2026 Poster
1. Strong Evidence of Multi-Task Synergy: The paper's strongest point is Table 6 . It clearly demonstrates the value of multi-task training. For example, the FLF2V and FLC2V tasks improve the "Dynamic" metric , while VINP and VOUTP boost semantic metrics. This shows the MfM framework is indeed learning a more robust and generalizable video representation. 2. SOTA Performance with a Single Model: Using the same 8B model, the method achieves the best "Average Rank" on both VBench-T2V and VBench-I2
1. Limited Novelty of Components: This is my main concern. While the MfM framework and its training results are novel, the architectural components are largely a combination of existing work. The model backbone is a DiT , the training technique is Flow Matching (RF) , and the stabilization technique is QK-Norm —a combination very similar to recent work (e.g., SD3). The proposed "adapter" also appears to be just a few convolutional layers. This makes the paper feel more like an excellent enginee
1. The proposed MfM framework is both simple and elegant. The use of a unified adapter for diverse 3D conditions (including pixel data, depth, and masks), combined with task-name conditioning in the text prompt, represents an effective and scalable solution. This approach successfully unifies a wide range of visual generation and manipulation tasks within a single model, eliminating the need for task-specific fine-tuning. 2. The model demonstrates empirically strong performance, achieving the h
1. MfM is trained using proprietary data, but the authors do not clearly delineate the extent to which the model’s performance is attributable to the MfM framework itself versus the use of high-quality proprietary data. This ambiguity significantly limits the reference value of this work for the broader research community. 2. Regarding the composition of the training data, the authors mention that the sampling probability for T2I, T2V, and I2V tasks is three times higher than for other tasks. I
The paper targets on unifying video generation and manipulation tasks. The adapter design is practical supporting seamless conditioning across modalities. The multi-task joint training strategy is demonstrated to be able to improve both performance and data efficiency. Ablation studies clearly show the benefits of multi-task learning and depth conditioning.
1. The architectural novelty of the proposed framework is limited. While the paper presents a unified system, its backbone design largely follows existing diffusion transformer paradigms such as SD3. The main contribution lies in integrating known components, i.e. flow matching, 3D attention, and adapter-based conditioning, into a unified training pipeline, rather than introducing new modeling framework. 2. Although the results are promising, the evaluation is based on relatively limited datase
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSurgical Simulation and Training
