SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov; Chenghao Xu; Shuo Sun; Olga Fink; Malcolm Mielle

arXiv:2603.18774·cs.CV·March 20, 2026

SEAR: Simple and Efficient Adaptation of Visual Geometric Transformers for RGB+Thermal 3D Reconstruction

Vsevolod Skorokhodov, Chenghao Xu, Shuo Sun, Olga Fink, Malcolm Mielle

PDF

Open Access

TL;DR

SEAR is a fine-tuning strategy that adapts pretrained visual geometry transformers for effective RGB-thermal 3D reconstruction and pose estimation, outperforming state-of-the-art methods with minimal inference overhead.

Contribution

The paper introduces SEAR, a simple and efficient fine-tuning approach that enables pretrained geometry transformers to handle multimodal RGB-T inputs for 3D reconstruction.

Findings

01

Significant performance improvements over state-of-the-art methods (e.g., over 29% in AUC@30).

02

Effective multimodal pose estimation under challenging conditions like low light and smoke.

03

Introduction of a new RGB-thermal dataset for benchmarking multimodal 3D reconstruction.

Abstract

Foundational feed-forward visual geometry models enable accurate and efficient camera pose estimation and scene reconstruction by learning strong scene priors from massive RGB datasets. However, their effectiveness drops when applied to mixed sensing modalities, such as RGB-thermal (RGB-T) images. We observe that while a visual geometry grounded transformer pretrained on RGB data generalizes well to thermal-only reconstruction, it struggles to align RGB and thermal modalities when processed jointly. To address this, we propose SEAR, a simple yet efficient fine-tuning strategy that adapts a pretrained geometry transformer to multimodal RGB-T inputs. Despite being trained on a relatively small RGB-T dataset, our approach significantly outperforms state-of-the-art methods for 3D reconstruction and camera pose estimation, achieving significant improvements over all metrics (e.g., over 29\%…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Robot Manipulation and Learning · Advanced Vision and Imaging