TL;DR
Franca is an open-source vision foundation model that surpasses proprietary models using nested Matryoshka clustering and positional disentanglement, enabling scalable, efficient, and transparent visual representation learning.
Contribution
It introduces a novel multi-head clustering projector with nested Matryoshka representations and a positional disentanglement strategy, advancing open, high-performance vision models.
Findings
Matches or surpasses state-of-the-art proprietary models
Improves downstream benchmark performance with cleaner features
Offers a scalable, memory-efficient clustering approach
Abstract
We present Franca (pronounced Fran-ka): free one; the first fully open-source (data, code, weights) vision foundation model that matches and in many cases surpasses the performance of state-of-the-art proprietary models, e.g., DINOv2, CLIP, SigLIPv2, etc. Our approach is grounded in a transparent training pipeline inspired by Web-SSL and uses publicly available data: ImageNet-21K and a subset of ReLAION-2B. Beyond model release, we tackle critical limitations in SSL clustering methods. While modern models rely on assigning image features to large codebooks via clustering algorithms like Sinkhorn-Knopp, they fail to account for the inherent ambiguity in clustering semantics. To address this, we introduce a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations. This design progressively refines features into increasingly fine-grained clusters…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
