Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Jiawei Yang; Ziyu Chen; Yurong You; Yan Wang; Yiming Li; Yuxiao Chen; Boyi Li; Boris Ivanovic; Marco Pavone; Yue Wang

arXiv:2512.10947·cs.CV·December 16, 2025

Towards Efficient and Effective Multi-Camera Encoding for End-to-End Driving

Jiawei Yang, Ziyu Chen, Yurong You, Yan Wang, Yiming Li, Yuxiao Chen, Boyi Li, Boris Ivanovic, Marco Pavone, Yue Wang

PDF

Open Access

TL;DR

Flex introduces a compact, learnable scene encoding method for multi-camera autonomous driving that improves inference speed and driving performance without relying on explicit 3D priors.

Contribution

The paper proposes a geometry-agnostic scene encoder using learnable tokens, enabling efficient multi-camera data processing without explicit 3D representations.

Findings

01

2.2x higher inference throughput

02

Improved driving performance over state-of-the-art

03

Emergent scene decomposition capabilities

Abstract

We present Flex, an efficient and effective scene encoder that addresses the computational bottleneck of processing high-volume multi-camera data in end-to-end autonomous driving. Flex employs a small set of learnable scene tokens to jointly encode information from all image tokens across different cameras and timesteps. By design, our approach is geometry-agnostic, learning a compact scene representation directly from data without relying on the explicit 3D inductive biases, such as Bird-Eye-View (BEV), occupancy or tri-plane representations, which are common in prior work. This holistic encoding strategy aggressively compresses the visual input for the downstream Large Language Model (LLM) based policy model. Evaluated on a large-scale proprietary dataset of 20,000 driving hours, our Flex achieves 2.2x greater inference throughput while improving driving performance by a large margin…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications