C-Pack: Packed Resources For General Chinese Embeddings
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff and, Defu Lian, Jian-Yun Nie

TL;DR
C-Pack introduces comprehensive resources including benchmarks, datasets, and models to significantly improve Chinese text embeddings, achieving state-of-the-art results and providing valuable tools for future research.
Contribution
The paper presents a new suite of resources—C-MTEB, C-MTP, and C-TEM—that advance Chinese embeddings and outperform previous models, also including English embedding resources.
Findings
C-TEM models outperform prior Chinese embeddings by up to 10%.
The English models achieve state-of-the-art performance on MTEB.
The released English data is twice as large as the Chinese data.
Abstract
We introduce C-Pack, a package of resources that significantly advance the field of general Chinese embeddings. C-Pack includes three critical resources. 1) C-MTEB is a comprehensive benchmark for Chinese text embeddings covering 6 tasks and 35 datasets. 2) C-MTP is a massive text embedding dataset curated from labeled and unlabeled Chinese corpora for training embedding models. 3) C-TEM is a family of embedding models covering multiple sizes. Our models outperform all prior Chinese text embeddings on C-MTEB by up to +10% upon the time of the release. We also integrate and optimize the entire suite of training methods for C-TEM. Along with our resources on general Chinese embedding, we release our data and models for English text embeddings. The English models achieve state-of-the-art performance on MTEB benchmark; meanwhile, our released English data is 2 times larger than the Chinese…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BAAI/bge-small-en-v1.5model· 12.0M dl· ♡ 43012.0M dl♡ 430
- 🤗BAAI/bge-large-en-v1.5model· 6.6M dl· ♡ 6386.6M dl♡ 638
- 🤗BAAI/bge-base-en-v1.5model· 5.5M dl· ♡ 4085.5M dl♡ 408
- 🤗BAAI/bge-small-enmodel· 373k dl· ♡ 90373k dl♡ 90
- 🤗BAAI/bge-reranker-basemodel· 2.3M dl· ♡ 2272.3M dl♡ 227
- 🤗BAAI/bge-large-zh-v1.5model· 610k dl· ♡ 616610k dl♡ 616
- 🤗BAAI/bge-small-zh-v1.5model· 1.5M dl· ♡ 1041.5M dl♡ 104
- 🤗BAAI/bge-reranker-largemodel· 790k dl· ♡ 454790k dl♡ 454
- 🤗BAAI/bge-large-enmodel· 219k dl· ♡ 224219k dl♡ 224
- 🤗BAAI/bge-large-zhmodel· 14k dl· ♡ 34514k dl♡ 345
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Hate Speech and Cyberbullying Detection
