EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation
Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, Qi Zhu

TL;DR
EmboAlign is a novel framework that leverages vision-language models to impose compositional constraints on video generative models, enabling more accurate and safe zero-shot robotic manipulation without task-specific training.
Contribution
The paper introduces a data-free method that aligns video generative models with constraints from vision-language models at inference time for improved robotic manipulation.
Findings
Improves success rate by 43.3 percentage points on real robot tasks.
Effectively filters and refines VGM outputs using constraint-guided selection.
Enhances zero-shot manipulation accuracy without additional training data.
Abstract
Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{}…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications
