ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

Juan Yeo; Soonwoo Cha; Jiwoo Song; Hyunbin Jin; Taesup Kim

arXiv:2506.08678·cs.CV·October 2, 2025

ATAS: Any-to-Any Self-Distillation for Enhanced Open-Vocabulary Dense Prediction

Juan Yeo, Soonwoo Cha, Jiwoo Song, Hyunbin Jin, Taesup Kim

PDF

Open Access

TL;DR

ATAS introduces a self-distillation method that improves CLIP's fine-grained, region-level understanding in open-vocabulary dense prediction tasks by enhancing semantic coherence and alignment without extra supervision.

Contribution

The paper proposes ATAS, a novel self-distillation approach that refines CLIP's representations for better dense prediction performance using only unlabeled images.

Findings

01

Significant performance improvements on object detection benchmarks.

02

Outperforms baseline CLIP models in semantic segmentation.

03

Effectively maintains semantic coherence while sharpening local details.

Abstract

Vision-language models such as CLIP have recently propelled open-vocabulary dense prediction tasks by enabling recognition of a broad range of visual concepts. However, CLIP still struggles with fine-grained, region-level understanding, hindering its effectiveness on these dense prediction tasks. We identify two pivotal factors required to address this limitation: semantic coherence and fine-grained vision-language alignment. Current adaptation methods often improve fine-grained alignment at the expense of semantic coherence, and often rely on extra modules or supervised fine-tuning. To overcome these issues, we propose Any-to-Any Self-Distillation (ATAS), a novel approach that simultaneously enhances semantic coherence and fine-grained alignment by leveraging own knowledge of a model across all representation levels. Unlike prior methods, ATAS uses only unlabeled images and an internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsContrastive Language-Image Pre-training