DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning

Mahmut Selman Gokmen; Cody Bumgardner

arXiv:2511.01610·cs.CV·November 4, 2025

DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning

Mahmut Selman Gokmen, Cody Bumgardner

PDF

Open Access

TL;DR

DINO-MX is a flexible, modular framework for self-supervised vision learning that supports diverse architectures and training strategies, achieving competitive results with reduced computational costs and enhanced interpretability.

Contribution

It introduces a unified, extensible training system combining multiple DINO variants, supporting various architectures, strategies, and data types, with tools for interpretability and data augmentation.

Findings

01

Achieves competitive performance on diverse datasets.

02

Reduces computational costs significantly.

03

Enhances interpretability and localization without extra detection heads.

Abstract

Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis