TL;DR
LAMP uses large language models to translate natural language descriptions into explicit 3D motion trajectories for objects and cameras, enhancing controllability in video generation.
Contribution
It introduces a novel framework leveraging LLMs and a domain-specific language to generate structured motion programs from natural language for video synthesis.
Findings
LAMP outperforms existing methods in motion controllability.
The framework effectively aligns generated motions with user intent.
A large-scale dataset supports training and evaluation.
Abstract
Video generation has achieved remarkable progress in visual fidelity and controllability, enabling conditioning on text, layout, or motion. Among these, motion control - specifying object dynamics and camera trajectories - is essential for composing complex, cinematic scenes, yet existing interfaces remain limited. We introduce LAMP that leverages large language models (LLMs) as motion planners to translate natural language descriptions into explicit 3D trajectories for dynamic objects and (relatively defined) cameras. LAMP defines a motion domain-specific language (DSL), inspired by cinematography conventions. By harnessing program synthesis capabilities of LLMs, LAMP generates structured motion programs from natural language, which are deterministically mapped to 3D trajectories. We construct a large-scale procedural dataset pairing natural text descriptions with corresponding motion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
