LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

Yun He; Francesco Pittaluga; Ziyu Jiang; Matthias Zwicker; Manmohan Chandraker; Zaid Tasneem

arXiv:2512.17445·cs.CV·April 10, 2026

LangDriveCTRL: Natural Language Controllable Driving Scene Editing with Multi-modal Agents

Yun He, Francesco Pittaluga, Ziyu Jiang, Matthias Zwicker, Manmohan Chandraker, Zaid Tasneem

PDF

1 Repo

TL;DR

LangDriveCTRL is a framework that enables natural language-based editing of driving videos by representing scenes as 3D graphs and using multi-agent systems for precise, realistic modifications.

Contribution

It introduces a novel multi-agent pipeline with scene graph representation and feedback mechanisms for fine-grained, photorealistic scene editing from natural language instructions.

Findings

01

Achieves nearly 2x higher instruction alignment than previous state-of-the-art methods.

02

Supports object removal, insertion, replacement, and multi-object behavior editing.

03

Produces more photorealistic and structurally consistent edited videos.

Abstract

LangDriveCTRL is a natural-language-controllable framework for editing real-world driving videos to synthesize diverse traffic scenarios. It represents each video as an explicit 3D scene graph, decomposing the scene into a static background and dynamic object nodes. To enable fine-grained editing and realism, it introduces a feedback-driven agentic pipeline. An Orchestrator converts user instructions into executable graphs that coordinate specialized multi-modal agents and tools. An Object Grounding Agent aligns free-form text with target object nodes in the scene graph; a Behavior Editing Agent generates multi-object trajectories from language instructions; and a Behavior Reviewer Agent iteratively reviews and refines the generated trajectories. The edited scene graph is rendered and harmonized using a video diffusion tool, and then further refined by a Video Reviewer Agent to ensure…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://yunhe24.github.io/langdrivectrl
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.