EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Gehao Zhang; Zhenyang Ni; Payal Mohapatra; Han Liu; Ruohan Zhang; Qi Zhu

arXiv:2603.05757·cs.RO·March 9, 2026

EmboAlign: Aligning Video Generation with Compositional Constraints for Zero-Shot Manipulation

Gehao Zhang, Zhenyang Ni, Payal Mohapatra, Han Liu, Ruohan Zhang, Qi Zhu

PDF

Open Access

TL;DR

EmboAlign is a novel framework that leverages vision-language models to impose compositional constraints on video generative models, enabling more accurate and safe zero-shot robotic manipulation without task-specific training.

Contribution

The paper introduces a data-free method that aligns video generative models with constraints from vision-language models at inference time for improved robotic manipulation.

Findings

01

Improves success rate by 43.3 percentage points on real robot tasks.

02

Effectively filters and refines VGM outputs using constraint-guided selection.

03

Enhances zero-shot manipulation accuracy without additional training data.

Abstract

Video generative models (VGMs) pretrained on large-scale internet data can produce temporally coherent rollout videos that capture rich object dynamics, offering a compelling foundation for zero-shot robotic manipulation. However, VGMs often produce physically implausible rollouts, and converting their pixel-space motion into robot actions through geometric retargeting further introduces cumulative errors from imperfect depth estimation and keypoint tracking. To address these challenges, we present \method{}, a data-free framework that aligns VGM outputs with compositional constraints generated by vision-language models (VLMs) at inference time. The key insight is that VLMs offer a capability complementary to VGMs: structured spatial reasoning that can identify the physical constraints critical to the success and safety of manipulation execution. Given a language instruction, \method{}…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications