TL;DR
SigLino introduces an efficient multi-teacher distillation method for vision models, leveraging novel loss functions, data sampling strategies, and a large curated dataset to improve transferability and efficiency.
Contribution
The paper presents SigLino, a new agglomerative vision foundation model framework that enhances multi-teacher distillation with innovative techniques and releases a large, efficient training dataset.
Findings
SigLino achieves effective knowledge transfer with a novel asymmetric relation-knowledge distillation loss.
Token-balanced batching stabilizes training across varying image resolutions.
Hierarchical data sampling improves sample efficiency over random sampling.
Abstract
Vision foundation models trained via multi-teacher distillation offer a promising path toward unified visual representations, yet the learning dynamics and data efficiency of such approaches remain underexplored. In this paper, we systematically study multi-teacher distillation for vision foundation models and identify key factors that enable training at lower computational cost. We introduce SigLino, an efficient family of agglomerative vision foundation models that distill knowledge from SigLIP2 and DINOv3 simultaneously into Dense and Mixture-of-Experts students. We show that (1) our Asymmetric Relation-Knowledge Distillation loss preserves the geometric properties of each teacher while enabling effective knowledge transfer, (2) token-balanced batching that packs varying-resolution images into sequences with uniform token budgets stabilizes representation learning across resolutions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
