Kernel-based Unsupervised Embedding Alignment for Enhanced Visual Representation in Vision-language Models
Shizhan Gong, Yankai Jiang, Qi Dou, Farzan Farnia

TL;DR
This paper introduces a kernel-based alignment method to improve CLIP's visual representations by leveraging DINOv2, resulting in enhanced fine-grained perception and downstream multi-modal model performance.
Contribution
A novel kernel-based alignment technique that enhances CLIP's visual embeddings with DINOv2's fine-grained perception capabilities without losing text compatibility.
Findings
Improved zero-shot object recognition accuracy.
Enhanced fine-grained spatial reasoning.
Better localization in downstream tasks.
Abstract
Vision-language models, such as CLIP, have achieved significant success in aligning visual and textual representations, becoming essential components of many multi-modal large language models (MLLMs) like LLaVA and OpenFlamingo. However, numerous studies have identified CLIP's limited fine-grained perception as a critical drawback, leading to substantial failures in downstream MLLMs. In contrast, vision-centric foundation models like DINOv2 demonstrate remarkable capabilities in capturing fine details from images. In this work, we propose a novel kernel-based method to align CLIP's visual representation with that of DINOv2, ensuring that the resulting embeddings maintain compatibility with text embeddings while enhancing perceptual capabilities. Our alignment objective is designed for efficient stochastic optimization. Following this image-only alignment fine-tuning, the visual encoder…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
