TL;DR
Ramen is a framework that enhances the robustness of vision-language models during test-time by actively selecting relevant samples for adaptation, especially effective under mixed-domain shifts.
Contribution
It introduces an active sample selection method with an embedding-gradient cache for efficient, robust test-time adaptation in mixed-domain scenarios.
Findings
Ramen outperforms existing methods on multiple benchmarks.
It maintains strong performance under mixed-domain test data.
The embedding-gradient cache improves adaptation efficiency.
Abstract
Pretrained vision-language models such as CLIP exhibit strong zero-shot generalization but remain sensitive to distribution shifts. Test-time adaptation adapts models during inference without access to source data or target labels, offering a practical way to handle such shifts. However, existing methods typically assume that test samples come from a single, consistent domain, while in practice, test data often include samples from mixed domains with distinct characteristics. Consequently, their performance degrades under mixed-domain settings. To address this, we present Ramen, a framework for robust test-time adaptation through active sample selection. For each incoming test sample, Ramen retrieves a customized batch of relevant samples from previously seen data based on two criteria: domain consistency, which ensures that adaptation focuses on data from similar domains, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
