TL;DR
MuRF is a training-free method that enhances vision foundation models by fusing multi-resolution features at inference time, improving performance across various tasks without architectural changes.
Contribution
Proposes MuRF, a universal, inference-time multi-resolution fusion strategy that boosts vision models' capabilities across diverse tasks without retraining.
Findings
MuRF improves performance on multiple vision tasks.
It generalizes across different VFM architectures.
MuRF is simple, training-free, and universally applicable.
Abstract
Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale. This prevalent single-scale paradigm overlooks a fundamental property of visual perception: varying resolutions offer complementary inductive biases, where low-resolution views excel at global semantic recognition and high-resolution views are essential for fine-grained refinement. In this work, we propose Multi-Resolution Fusion (MuRF), a simple yet universally effective strategy to harness this synergy at inference time. Instead of relying on a single view, MuRF constructs a unified representation by processing an image at multiple resolutions through a frozen VFM and fusing the resulting…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
