Search-TTA: A Multimodal Test-Time Adaptation Framework for Visual Search in the Wild
Derek Ming Siang Tan, Shailesh, Boyang Liu, Alok Raj, Qi Xuan Ang, Weiheng Dai, Tanishq Duhan, Jimmy Chiun, Yuhong Cao, Florian Shkurti, and Guillaume Sartoretti

TL;DR
Search-TTA is a versatile multimodal test-time adaptation framework that enhances visual search performance in outdoor navigation tasks by refining model predictions during search, especially under domain mismatch and limited data conditions.
Contribution
It introduces a novel test-time adaptation method that dynamically refines multimodal visual search predictions using uncertainty-weighted updates, with a new dataset for evaluation.
Findings
Improves planner performance by up to 30%.
Performs comparably with larger vision models.
Achieves zero-shot generalization to unseen modalities.
Abstract
To perform outdoor visual navigation and search, a robot may leverage satellite imagery to generate visual priors. This can help inform high-level search strategies, even when such images lack sufficient resolution for target recognition. However, many existing informative path planning or search-based approaches either assume no prior information, or use priors without accounting for how they were obtained. Recent work instead utilizes large Vision Language Models (VLMs) for generalizable priors, but their outputs can be inaccurate due to hallucination, leading to inefficient search. To address these challenges, we introduce Search-TTA, a multimodal test-time adaptation framework with a flexible plug-and-play interface compatible with various input modalities (e.g., image, text, sound) and planning methods (e.g., RL-based). First, we pretrain a satellite image encoder to align with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
