DINO-MX: A Modular & Flexible Framework for Self-Supervised Learning
Mahmut Selman Gokmen, Cody Bumgardner

TL;DR
DINO-MX is a flexible, modular framework for self-supervised vision learning that supports diverse architectures and training strategies, achieving competitive results with reduced computational costs and enhanced interpretability.
Contribution
It introduces a unified, extensible training system combining multiple DINO variants, supporting various architectures, strategies, and data types, with tools for interpretability and data augmentation.
Findings
Achieves competitive performance on diverse datasets.
Reduces computational costs significantly.
Enhances interpretability and localization without extra detection heads.
Abstract
Vision Foundation Models (VFMs) have advanced representation learning through self-supervised methods. However, existing training pipelines are often inflexible, domain-specific, or computationally expensive, which limits their usability across different domains and resource settings. DINO-MX is a modular and extensible training framework that combines the core principles of DINO, DINOv2 and DINOv3 within a unified configuration-driven system. It supports a variety of transformer-based architectures and is fully compatible with the Hugging Face ecosystem. The framework includes multiple training strategies such as low-rank adaptation (LoRA), layer freezing, and knowledge distillation, along with support for distributed training through both Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP). DINO-MX is designed to work with both natural and specialized data types,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFace recognition and analysis · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
