DetailCLIP: Injecting Image Details into CLIP's Feature Space

Zilun Zhang; Cuifeng Shen; Yuan Shen; Xinyu Zhou; Huixin Xiong; Tiancheng Zhao; Jianwei Yin

arXiv:2208.14649·cs.CV·April 23, 2026

DetailCLIP: Injecting Image Details into CLIP's Feature Space

Zilun Zhang, Cuifeng Shen, Yuan Shen, Xinyu Zhou, Huixin Xiong, Tiancheng Zhao, Jianwei Yin

PDF

1 Repo

TL;DR

DetailCLIP enhances CLIP's ability to retain high-resolution image details across scales by fusing features from multiple patches, significantly improving image retrieval in scenarios with tiny objects.

Contribution

The paper introduces a feature fusion framework that preserves multi-scale image details in CLIP's feature space, addressing resolution limitations in high-resolution images.

Findings

01

Significant improvement in class prompted image retrieval accuracy.

02

Effective preservation of multi-scale image details in feature representations.

03

Demonstrated capability in detail retrieval using a synthetic dataset.

Abstract

Although CLIP-like Visual Language Models provide a functional joint feature space for image and text, due to the limitation of the CILP-like model's image input size (e.g., 224), subtle details are lost in the feature representation if we input high-resolution images (e.g., 2240). Our proposed framework addresses this issue by generating a single feature representation for a high-resolution image that retains image details from different scales while sharing the same semantic space as the original CLIP. An application scenario is remote sensing text-image retrieval, where targets (e.g., vehicles and ships) often appear at tiny scales. To achieve this, we develop a feature fusion model that relies on CLIP features extracted from a carefully designed image patch method, dubbed Complete Cover. This method ensures comprehensive coverage of objects across various scales and is weakly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zilunzhang/DetailCLIP
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.