Mitty: Diffusion-based Human-to-Robot Video Generation
Yiren Song, Cheng Liu, Weijia Mao, Mike Zheng Shou

TL;DR
Mitty is a diffusion transformer model that enables end-to-end human-to-robot video generation directly from demonstration videos, improving temporal and visual consistency without relying on intermediate representations.
Contribution
Mitty introduces a diffusion transformer architecture for human-to-robot video generation that leverages in-context learning and a novel data synthesis pipeline for training.
Findings
Achieves state-of-the-art results on Human2Robot and EPIC-Kitchens datasets.
Demonstrates strong generalization to unseen environments.
Provides new insights for scalable robot learning from human observations.
Abstract
Learning directly from human demonstration videos is a key milestone toward scalable and generalizable robot learning. Yet existing methods rely on intermediate representations such as keypoints or trajectories, introducing information loss and cumulative errors that harm temporal and visual consistency. We present Mitty, a Diffusion Transformer that enables video In-Context Learning for end-to-end Human2Robot video generation. Built on a pretrained video diffusion model, Mitty leverages strong visual-temporal priors to translate human demonstrations into robot-execution videos without action labels or intermediate abstractions. Demonstration videos are compressed into condition tokens and fused with robot denoising tokens through bidirectional attention during diffusion. To mitigate paired-data scarcity, we also develop an automatic synthesis pipeline that produces high-quality…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
Mitty addresses an important aspect of the data collection problem for robotics. While Masquerade was a multi-step pipeline with learned components, Mitty shows that it can be learned end-to-end and can improve the visual quality.
The paper explores the benefits of distilling the human-robot paired data (where the robot is synthetically generated with a pipeline similar to Masquerade). While this is interesting investigation, there are questions on the dataset quality and what its effects would be on the downstream real robot tasks. A noted limitation is that Mitty only generates robot videos and not actions. While the paper presents improvements in visual quality with end-to-end training, no real robot experiments are
Conceptually, if a human demonstration could be easily translated into a robot demonstration the idea is that (in principle) we could have arbitrary training data for robotics.
In practice, this is a video-to-video translation task, with a focus on specific intermediate modules and domains that make it geared towards robotics specifically. The tasks take the traditional overhead PnP format, but there is no evidence provided that the resulting videos are useful to robotics. Minor: - L240 improper citation spacing
1. Propose a unified diffusion-based transformer that learns to map human demonstration videos to robotic executions without explicit action labels or intermediate representations, enable end-to-end video generation. 2. Propose a synthetic paired dataset creation pipeline which automatically constructs high-quality human–robot video pairs by rendering robot arms into egocentric human videos, using hand segmentation, inpainting, and pose-mapping. This pipeline can leverages egocentric human video
1. Diffusion Transformer novelty is somewhat incremental, largely built upon Wan 2.2 with modifications (LoRA tuning and bidirectional attention). While effective, it may not constitute a deep model innovation. 2. The synthetic paired data generation pipeline introduces possible domain biases, accumulated errors at each step may lead to unrealistic or misaligned robot motion videos. 3. Evaluation focuses mainly on visual quality and human preference, lacking explicit quantitative task success me
This work makes good usage of existing pre-trained video model, and proposes an effective conditional mechanism using bidirectional attention for video generation. Dataset construction, experiment design, and baseline design are done well, with the resulting empirical performances being very strong. This paper is written well and tackles an important robot learning problem: Solving the problem of paired human-robot video generation can have a great impact on the field, and the results shown in
While the end-to-end nature of this work seems likely to lead to less compounding errors in video generation compared to a multistage approach like Masquerade. The human preference rating results seem a bit unfair for Masquerade because this work uses a human-in-the-loop filtering step while Masquerade does not. For the sake of improving this paper, I would be interesting to see the same set of experiments but without the human-in-the-loop filtering step. But for designing a useful pipeline, I
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Robot Manipulation and Learning · Multimodal Machine Learning Applications
