TL;DR
ML-Embed introduces a suite of inclusive, efficient multilingual embedding models built on a novel 3D-ML framework, addressing computational costs, linguistic diversity, and transparency issues in text embeddings.
Contribution
The paper presents ML-Embed and the 3D-ML framework, combining efficiency, multilingual coverage, and transparency, with models and data openly released for reproducibility.
Findings
Models set new records on 9 of 17 MTEB benchmarks.
Strong performance in low-resource languages.
Efficient across the entire model lifecycle.
Abstract
The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
