PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Yicheng Xiao; Yu Chen; Haoxuan Ma; Jiale Hong; Caorui Li; Lingxiang Wu; Haiyun Guo; Jinqiao Wang

arXiv:2511.04601·cs.CV·November 7, 2025

PixCLIP: Achieving Fine-grained Visual Language Understanding via Any-granularity Pixel-Text Alignment Learning

Yicheng Xiao, Yu Chen, Haoxuan Ma, Jiale Hong, Caorui Li, Lingxiang Wu, Haiyun Guo, Jinqiao Wang

PDF

Open Access

TL;DR

PixCLIP introduces a novel framework that enhances fine-grained visual-language understanding by integrating pixel-level alignment with long textual descriptions, leveraging a new dataset and a three-branch learning approach.

Contribution

The paper presents PixCLIP, a framework that combines visual prompts and long text processing for improved pixel-text alignment, along with a new dataset LongGRIT and a three-branch learning method.

Findings

01

Achieves state-of-the-art performance in fine-grained vision-language tasks.

02

Demonstrates superior pixel-level interaction and long-form text handling.

03

Outperforms existing models in detailed image-text alignment.

Abstract

While the Contrastive Language-Image Pretraining(CLIP) model has achieved remarkable success in a variety of downstream vison language understanding tasks, enhancing its capability for fine-grained image-text alignment remains an active research focus. To this end, most existing works adopt the strategy of explicitly increasing the granularity of visual information processing, e.g., incorporating visual prompts to guide the model focus on specific local regions within the image. Meanwhile, researches on Multimodal Large Language Models(MLLMs) have demonstrated that training with long and detailed textual descriptions can effectively improve the model's fine-grained vision-language alignment. However, the inherent token length limitation of CLIP's text encoder fundamentally limits CLIP to process more granular textual information embedded in long text sequences. To synergistically…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning