Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

Boris Ivanovic; Cristiano Saltori; Yurong You; Yan Wang; Wenjie Luo; Marco Pavone

arXiv:2506.12251·cs.CV·July 22, 2025

Efficient Multi-Camera Tokenization with Triplanes for End-to-End Driving

Boris Ivanovic, Cristiano Saltori, Yurong You, Yan Wang, Wenjie Luo, Marco Pavone

PDF

Open Access

TL;DR

This paper introduces a triplane-based multi-camera tokenization method for autonomous vehicles that reduces token count and inference time, enabling efficient end-to-end transformer-based driving policies without sacrificing accuracy.

Contribution

The paper proposes a novel 3D neural reconstruction-based tokenization strategy that is camera-agnostic and geometry-aware, improving efficiency over traditional image patch methods.

Findings

01

Up to 72% fewer tokens generated

02

50% faster policy inference

03

Maintains same motion planning accuracy

Abstract

Autoregressive Transformers are increasingly being deployed as end-to-end robot and autonomous vehicle (AV) policy architectures, owing to their scalability and potential to leverage internet-scale pretraining for generalization. Accordingly, tokenizing sensor data efficiently is paramount to ensuring the real-time feasibility of such architectures on embedded hardware. To this end, we present an efficient triplane-based multi-camera tokenization strategy that leverages recent advances in 3D neural reconstruction and rendering to produce sensor tokens that are agnostic to the number of input cameras and their resolution, while explicitly accounting for their geometry around an AV. Experiments on a large-scale AV dataset and state-of-the-art neural simulator demonstrate that our approach yields significant savings over current image patch-based tokenization strategies, producing up to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image and Video Stabilization · Video Coding and Compression Technologies