F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World
Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

TL;DR
F2LLM-v2 introduces a family of multilingual embedding models that are highly efficient, support over 200 languages, and achieve top performance on multiple benchmarks, with open-source availability.
Contribution
The paper presents a new scalable, efficient, and multilingual embedding model family with innovative training techniques and extensive evaluation, advancing open-source multilingual NLP resources.
Findings
F2LLM-v2-14B ranks first on 11 MTEB benchmarks.
Smaller models achieve state-of-the-art results for resource-constrained tasks.
Models support over 200 languages, including underserved ones.
Abstract
We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗codefuse-ai/F2LLM-v2-14Bmodel· 489 dl· ♡ 5489 dl♡ 5
- 🤗codefuse-ai/F2LLM-v2-80Mmodel· 408 dl· ♡ 5408 dl♡ 5
- 🤗codefuse-ai/F2LLM-v2-1.7Bmodel· 1.5k dl· ♡ 41.5k dl♡ 4
- 🤗codefuse-ai/F2LLM-v2-8B-Previewmodel· 36 dl· ♡ 336 dl♡ 3
- 🤗codefuse-ai/F2LLM-v2-1.7B-Previewmodel· 220 dl· ♡ 2220 dl♡ 2
- 🤗codefuse-ai/F2LLM-v2-0.6B-Previewmodel· 128 dl· ♡ 3128 dl♡ 3
- 🤗codefuse-ai/F2LLM-v2-4B-Previewmodel· 67 dl· ♡ 267 dl♡ 2
- 🤗codefuse-ai/F2LLM-v2-14B-Previewmodel· 34 dl· ♡ 234 dl♡ 2
- 🤗codefuse-ai/F2LLM-v2-0.6Bmodel· 1.0k dl· ♡ 21.0k dl♡ 2
- 🤗codefuse-ai/F2LLM-v2-160Mmodel· 398 dl· ♡ 2398 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Natural Language Processing Techniques · Topic Modeling
