Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Yuhao Sun; Chengyi Cai; Jiacheng Zhang; Zesheng Ye; Xingliang Yuan; Feng Liu

arXiv:2601.20419·cs.CV·January 29, 2026

Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Yuhao Sun, Chengyi Cai, Jiacheng Zhang, Zesheng Ye, Xingliang Yuan, Feng Liu

PDF

Open Access

TL;DR

BiFTA introduces a bi-refinement approach that enhances fine-grained text-visual alignment by removing redundant information from both images and text, significantly improving zero-shot performance in vision-language models.

Contribution

The paper proposes a novel bi-refinement method that improves alignment by filtering redundant image patches and text descriptions, leading to better zero-shot results.

Findings

01

Achieves superior zero-shot performance on 6 benchmark datasets.

02

Effectively removes redundant information from images and text.

03

Improves alignment quality in vision-language models.

Abstract

Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis