TL;DR
DetailCLIP enhances CLIP's ability to retain high-resolution image details across scales by fusing features from multiple patches, significantly improving image retrieval in scenarios with tiny objects.
Contribution
The paper introduces a feature fusion framework that preserves multi-scale image details in CLIP's feature space, addressing resolution limitations in high-resolution images.
Findings
Significant improvement in class prompted image retrieval accuracy.
Effective preservation of multi-scale image details in feature representations.
Demonstrated capability in detail retrieval using a synthetic dataset.
Abstract
Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). Our proposed framework addresses this issue by generating a single feature representation for a high-resolution image that retains image details from different scales while sharing the same semantic space as the original CLIP. An application scenario is remote sensing text-image retrieval, where targets (e.g., vehicles and ships) often appear at tiny scales. To achieve this, we develop a feature fusion model that relies on CLIP features extracted from a carefully designed image patch method, dubbed Complete Cover. This method ensures comprehensive coverage of objects across various scales and is weakly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
