K-Gen: A Multimodal Language-Conditioned Approach for Interpretable Keypoint-Guided Trajectory Generation
Mingxuan Mu, Guo Yang, Lei Chen, Ping Wu, Jianxun Cui

TL;DR
K-Gen introduces a multimodal, interpretable framework that combines visual and textual scene understanding with keypoint guidance to generate realistic trajectories for autonomous driving, improving over existing methods.
Contribution
The paper presents K-Gen, a novel multimodal approach that unifies rasterized maps and scene descriptions to generate interpretable keypoints and trajectories, enhanced by a new reinforcement fine-tuning algorithm.
Findings
K-Gen outperforms baselines on WOMD and nuPlan datasets.
Keypoint-guided reasoning improves trajectory accuracy.
Reinforcement fine-tuning enhances keypoint generation quality.
Abstract
Generating realistic and diverse trajectories is a critical challenge in autonomous driving simulation. While Large Language Models (LLMs) show promise, existing methods often rely on structured data like vectorized maps, which fail to capture the rich, unstructured visual context of a scene. To address this, we propose K-Gen, an interpretable keypoint-guided multimodal framework that leverages Multimodal Large Language Models (MLLMs) to unify rasterized BEV map inputs with textual scene descriptions. Instead of directly predicting full trajectories, K-Gen generates interpretable keypoints along with reasoning that reflects agent intentions, which are subsequently refined into accurate trajectories by a refinement module. To further enhance keypoint generation, we apply T-DAPO, a trajectory-aware reinforcement fine-tuning algorithm. Experiments on WOMD and nuPlan demonstrate that K-Gen…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAutonomous Vehicle Technology and Safety · Multimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis
