TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Bingyi Cao; Koert Chen; Kevis-Kokitsi Maninis; Kaifeng Chen; Arjun Karpur; Ye Xia; Sahil Dua; Tanmaya Dabral; Guangxing Han; Bohyung Han; Joshua Ainslie; Alex Bewley; Mithun Jacob; Ren\'e Wagner; Washington Ramos; Krzysztof Choromanski; Mojtaba Seyedhosseini; Howard Zhou; Andr\'e Araujo

arXiv:2604.12012·cs.CV·April 15, 2026

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Bingyi Cao, Koert Chen, Kevis-Kokitsi Maninis, Kaifeng Chen, Arjun Karpur, Ye Xia, Sahil Dua, Tanmaya Dabral, Guangxing Han, Bohyung Han, Joshua Ainslie, Alex Bewley, Mithun Jacob, Ren\'e Wagner, Washington Ramos, Krzysztof Choromanski, Mojtaba Seyedhosseini, Howard Zhou

PDF

1 Repo

TL;DR

TIPSv2 introduces enhanced patch-text alignment techniques in vision-language models, significantly improving dense correspondence and downstream task performance through novel training strategies and model upgrades.

Contribution

The paper proposes novel methods, including patch-level distillation and an upgraded pretraining objective, to substantially improve dense patch-text alignment in vision-language models.

Findings

01

Patch-level distillation surpasses teacher models in patch-text alignment.

02

iBOT++ with unmasked token loss enhances pretraining effectiveness.

03

TIPSv2 achieves state-of-the-art results on multiple vision-language tasks.

Abstract

Recent progress in vision-language pretraining has enabled significant improvements to many downstream computer vision applications, such as classification, retrieval, segmentation and depth prediction. However, a fundamental capability that these models still struggle with is aligning dense patch representations with text embeddings of corresponding concepts. In this work, we investigate this critical issue and propose novel techniques to enhance this capability in foundational vision-language models. First, we reveal that a patch-level distillation procedure significantly boosts dense patch-text alignment -- surprisingly, the patch-text alignment of the distilled student model strongly surpasses that of the teacher model. This observation inspires us to consider modifications to pretraining recipes, leading us to propose iBOT++, an upgrade to the commonly-used iBOT masked image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://gdm-tipsv2.github.io
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.