VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization

Hannah Shafferman; Annika Thomas; Jouko Kinnari; Michael Ricard; Jose Nino; Jonathan How

arXiv:2507.11653·cs.CV·July 17, 2025

VISTA: Monocular Segmentation-Based Mapping for Appearance and View-Invariant Global Localization

Hannah Shafferman, Annika Thomas, Jouko Kinnari, Michael Ricard, Jose Nino, Jonathan How

PDF

Open Access

TL;DR

VISTA is a monocular, segmentation-based localization framework that robustly aligns vehicle positions across different environments and seasons without training, achieving high recall and low memory usage.

Contribution

It introduces a novel, domain-agnostic approach combining segmentation, tracking, and geometric matching for appearance and view-invariant localization.

Findings

01

Up to 69% improvement in recall over baseline methods.

02

Maintains a compact map only 0.6% the size of baseline maps.

03

Capable of real-time operation on resource-constrained platforms.

Abstract

Global localization is critical for autonomous navigation, particularly in scenarios where an agent must localize within a map generated in a different session or by another agent, as agents often have no prior knowledge about the correlation between reference frames. However, this task remains challenging in unstructured environments due to appearance changes induced by viewpoint variation, seasonal changes, spatial aliasing, and occlusions -- known failure modes for traditional place recognition methods. To address these challenges, we propose VISTA (View-Invariant Segmentation-Based Tracking for Frame Alignment), a novel open-set, monocular global localization framework that combines: 1) a front-end, object-based, segmentation and tracking pipeline, followed by 2) a submap correspondence search, which exploits geometric consistencies between environment maps to align vehicle…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Advanced Image and Video Retrieval Techniques · Robotics and Sensor-Based Localization