MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning
Guangli Chen, Dianzhao Li, Wenjian Zhong, Bangquan Xie, and Ostap Okhrin

TL;DR
MTA-RL introduces a transformer-based multi-modal perception and reinforcement learning framework for robust urban autonomous driving, improving generalization, stability, and interpretability over prior methods.
Contribution
The paper presents the first framework combining multi-modal transformer-based 3D affordances with RL for urban driving, enhancing robustness and sample efficiency.
Findings
Outperforms state-of-the-art baselines in CARLA across various densities.
Demonstrates superior zero-shot generalization to unseen towns.
Ablation confirms importance of multi-modal fusion and reward shaping.
Abstract
Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
