DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Sophia Sirko-Galouchenko; Spyros Gidaris; Antonin Vobecky; Andrei Bursuc; Nicolas Thome

arXiv:2506.18463·cs.CV·September 10, 2025

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome

PDF

Open Access 1 Repo

TL;DR

DIP is an unsupervised post-training method that enhances dense visual representations by simulating in-context tasks, significantly improving downstream scene understanding performance with minimal computational resources.

Contribution

It introduces a simple, unsupervised post-training approach using pseudo in-context tasks generated by a diffusion model, avoiding complex architectures.

Findings

01

Outperforms initial vision encoders on downstream tasks

02

Requires less than 9 hours of training on a single GPU

03

Effective for various real-world scene understanding tasks

Abstract

We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sirkosophia/dip
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis

MethodsDiffusion