Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Xiao Fu; Xintao Wang; Xian Liu; Jianhong Bai; Runsen Xu; Pengfei Wan; Di Zhang; Dahua Lin

arXiv:2506.01943·cs.CV·January 28, 2026

Learning Video Generation for Robotic Manipulation with Collaborative Trajectory Control

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, Dahua Lin

PDF

Open Access 3 Reviews

TL;DR

This paper introduces RoboMaster, a novel framework for generating robotic manipulation videos by modeling inter-object dynamics through a collaborative trajectory approach, improving multi-object interaction fidelity.

Contribution

RoboMaster uniquely decomposes interaction phases into three sub-stages and models each with dominant objects, addressing feature fusion issues in multi-object manipulation video generation.

Findings

01

Achieves state-of-the-art results on Bridge, RLBench, and SIMPLER datasets.

02

Effectively models multi-object interactions with improved visual fidelity.

03

Enhances semantic consistency with appearance- and shape-aware representations.

Abstract

Recent advances in video diffusion models shows promise for generating robotic decision-making data, with trajectory conditions further enabling fine-grained control. However, existing methods primarily focus on individual object motion and struggle to capture multi-object interaction crucial in complex manipulation. This limitation arises from entangled features in overlapping regions, leading to degraded visual fidelity. To address this, we present RoboMaster, a novel framework that models inter-object dynamics via a collaborative trajectory formulation. Unlike prior methods that decompose objects, our core is to decompose the interaction process into three sub-stages: pre-interaction, interaction, and post-interaction, and models each phase using the dominant object, specifically the robotic arm in the pre- and post-interaction phases and the manipulated object during interaction.…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 4

Strengths

The paper is well-written and easy to understand. The method outperforms the baselines against which it is compared. The design choices are sensibly ablated. The work contributes a dataset of 21.000 human-annotated 2D robot manipulator trajectories. The work includes an honest discussion of its limitations.

Weaknesses

The proposed method operates purely in image space: the generated trajectories require postprocessing by an inverse kinematics model and are not guaranteed to be realistic or executable. Unlike its baselines, the method requires a segmentation of the provided trajectory into multiple stages by the user. The manual masking of the interacted object could be replaced by an automatic grounding and segmentation. A purely 2D trajectory input is very limiting, yet this is somewhat alleviated by the abi

Reviewer 02Rating 4Confidence 3

Strengths

1. The paper argues that interaction should originate from multiple entities, including the arm and the object—this is a novel viewpoint. 2. RoboMaster exhibits impressive OOD generalization.

Weaknesses

1. The motivation for decoupling the control signals is unclear; it is not explained how 2D trajectories help the robot learn, and the paper does not discuss the overall design in detail. 2. Section 4.5 is too brief, making it difficult to verify the method’s effectiveness for the robot; visual quality is not the core of the research—the core is whether the designed method can effectively aid robot learning.

Reviewer 03Rating 6Confidence 3

Strengths

1. Novel collaborative trajectory design Introduces a new way to model robot–object interactions using a single collaborative trajectory split into pre-interaction, interaction, and post-interaction phases. Avoids the feature entanglement issues (e.g., missing or distorted objects) that plague prior methods like Tora and DragAnything. 2. High visual and physical realism Produces smoother, more physically plausible manipulation videos with consistent object identities across frames. Quantita

Weaknesses

1. Restricted to 2D pixel space The system does not yet model depth or 3D geometry; this limits physical accuracy and makes 3D control (e.g., precise grasping) difficult. 2. Possible failure on out-of-domain inputs Can produce incomplete or distorted objects when encountering unseen categories or backgrounds. Still relies on training data diversity to generalize effectively. 3. Semantic dependency on user input Relies on accurate prompts and roughly correct masks. Misleading text or poor m

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Robotic Path Planning Algorithms · Robotic Mechanisms and Dynamics