Translating Images into Maps
Avishkar Saha, Oscar Mendez Maldonado, Chris Russell, Richard Bowden

TL;DR
This paper introduces a novel transformer-based method for converting images into top-down maps in real-time, achieving state-of-the-art results on large-scale datasets by framing map generation as a sequence translation problem.
Contribution
The authors propose a constrained transformer network that models image-to-map translation as a sequence-to-sequence problem, leveraging physical assumptions for improved efficiency and accuracy.
Findings
Achieved 15% and 30% relative improvements on nuScenes and Argoverse datasets.
Developed a convolutional, sequence-based transformer architecture for image-to-map translation.
Demonstrated state-of-the-art performance in instantaneous mapping tasks.
Abstract
We approach instantaneous mapping, converting images to a top-down view of the world, as a translation problem. We show how a novel form of transformer network can be used to map from images and video directly to an overhead map or bird's-eye-view (BEV) of the world, in a single end-to-end network. We assume a 1-1 correspondence between a vertical scanline in the image, and rays passing through the camera location in an overhead map. This lets us formulate map generation from an image as a set of sequence-to-sequence translations. Posing the problem as translation allows the network to use the context of the image when interpreting the role of each pixel. This constrained formulation, based upon a strong physical grounding of the problem, leads to a restricted transformer network that is convolutional in the horizontal direction only. The structure allows us to make efficient use of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Multimodal Machine Learning Applications
