A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Rishabh Kabra; Maks Ovsjanikov; Drew A. Hudson; Ye Xia; Skanda Koppula; Andre Araujo; Joao Carreira; Niloy J. Mitra

arXiv:2602.24181·cs.CV·March 2, 2026

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Rishabh Kabra, Maks Ovsjanikov, Drew A. Hudson, Ye Xia, Skanda Koppula, Andre Araujo, Joao Carreira, Niloy J. Mitra

PDF

Open Access

TL;DR

This paper introduces the Omnivorous Vision Encoder, a framework that learns a unified, modality-agnostic feature space for various visual inputs, enhancing cross-modal understanding while maintaining the strengths of pre-trained models like DINOv2.

Contribution

It proposes a novel dual-objective training method to create a vision encoder that produces consistent embeddings across different modalities, improving cross-modal alignment.

Findings

01

Achieves better cross-modal feature alignment compared to baseline models.

02

Produces consistent scene embeddings across multiple input modalities.

03

Retains the discriminative power of the original foundation model.

Abstract

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Multimodal Machine Learning Applications · Robot Manipulation and Learning