Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

Jeffrin Sam; Nguyen Khang; Yara Mahmoud; Miguel Altamirano Cabrera; Dzmitry Tsetserukou

arXiv:2605.01477·cs.RO·May 5, 2026

Action Agent: Agentic Video Generation Meets Flow-Constrained Diffusion

Jeffrin Sam, Nguyen Khang, Yara Mahmoud, Miguel Altamirano Cabrera, Dzmitry Tsetserukou

PDF

TL;DR

Action Agent introduces a two-stage framework combining language-guided video synthesis and flow-constrained diffusion to enhance multi-embodiment robot navigation success in simulation and real-world environments.

Contribution

It unifies agentic navigation video generation with flow-constrained diffusion control, achieving high success rates across multiple robot embodiments.

Findings

01

Video generation success increased from 35% to 86% across 50 tasks.

02

Achieved 73.2% success in simulation and 64.7% in real-world indoor navigation.

03

Operates at 40--47 Hz on a 43M-parameter model.

Abstract

We present Action Agent, a two-stage framework that unifies agentic navigation video generation with flow-constrained diffusion control for multi-embodiment robot navigation. In Stage I, a large language model (LLM) acts as an orchestration module that selects video diffusion models, refines prompts through iterative validation, and accumulates cross-task memory to synthesize physically plausible first-person navigation videos from language and image inputs. This increases video generation success from 35% (single-shot) to 86% across 50 navigation tasks. In Stage II, we introduce FlowDiT, a Flow-Constrained Diffusion Transformer that converts optimized goal videos and language instructions into continuous velocity commands using action-space denoising diffusion. FlowDiT integrates DINOv2 visual features, learned optical flow for ego-motion representation, and CLIP language embeddings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.