TL;DR
LangDriveCTRL is a framework that enables natural language-based editing of driving videos by representing scenes as 3D graphs and using multi-agent systems for precise, realistic modifications.
Contribution
It introduces a novel multi-agent pipeline with scene graph representation and feedback mechanisms for fine-grained, photorealistic scene editing from natural language instructions.
Findings
Achieves nearly 2x higher instruction alignment than previous state-of-the-art methods.
Supports object removal, insertion, replacement, and multi-object behavior editing.
Produces more photorealistic and structurally consistent edited videos.
Abstract
LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It represents each video as an explicit 3D scene graph, decomposing the scene into a static background and dynamic object nodes. To enable fine-grained editing and realism, it introduces a feedback-driven agentic pipeline. An Orchestrator converts user instructions into executable graphs that coordinate specialized multi-modal agents and tools. An Object Grounding Agent aligns free-form text with target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and harmonized using a video diffusion tool, and then further refined by a Video Reviewer Agent to ensure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
