MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

Guangli Chen; Dianzhao Li; Wenjian Zhong; Bangquan Xie; and Ostap Okhrin

arXiv:2605.10177·cs.CV·May 12, 2026

MTA-RL: Robust Urban Driving via Multi-modal Transformer-based 3D Affordances and Reinforcement Learning

Guangli Chen, Dianzhao Li, Wenjian Zhong, Bangquan Xie, and Ostap Okhrin

PDF

TL;DR

MTA-RL introduces a transformer-based multi-modal perception and reinforcement learning framework for robust urban autonomous driving, improving generalization, stability, and interpretability over prior methods.

Contribution

The paper presents the first framework combining multi-modal transformer-based 3D affordances with RL for urban driving, enhancing robustness and sample efficiency.

Findings

01

Outperforms state-of-the-art baselines in CARLA across various densities.

02

Demonstrates superior zero-shot generalization to unseen towns.

03

Ablation confirms importance of multi-modal fusion and reward shaping.

Abstract

Robust urban autonomous driving requires reliable 3D scene understanding and stable decision-making under dense interactions. However, existing end-to-end models lack interpretability, while modular pipelines suffer from error propagation across brittle interfaces. This paper proposes MTA-RL, the first framework that bridges perception and control through Multi-modal Transformer-based 3D Affordances and Reinforcement Learning (RL). Unlike previous fusion models that directly regress actions, RGB images and LiDAR point clouds are fused using a transformer architecture to predict explicit, geometry-aware affordance representations. These structured representations serve as a compact observation space, enabling the RL policy to operate purely on predicted driving semantics, which significantly improves sample efficiency and stability. Extensive evaluations in CARLA Town01-03 across varying…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.