Ruri: Japanese General Text Embeddings
Hayato Tsukagoshi, Ryohei Sasano

TL;DR
Ruri is a new Japanese general text embedding model developed using synthesized datasets and knowledge distillation, addressing the lack of Japanese-specific models and datasets in the field.
Contribution
This paper introduces Ruri, the first Japanese general text embedding models trained on synthesized datasets with a novel dataset filtering and knowledge distillation process.
Findings
Ruri achieves competitive performance on Japanese NLP tasks.
Synthesized datasets effectively compensate for the lack of real datasets.
The development process improves Japanese text embedding quality.
Abstract
We report the development of Ruri, a series of Japanese general text embedding models. While the development of general-purpose text embedding models in English and multilingual contexts has been active in recent years, model development in Japanese remains insufficient. The primary reasons for this are the lack of datasets and the absence of necessary expertise. In this report, we provide a detailed account of the development process of Ruri. Specifically, we discuss the training of embedding models using synthesized datasets generated by LLMs, the construction of the reranker for dataset filtering and knowledge distillation, and the performance evaluation of the resulting general-purpose text embedding models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗cl-nagoya/ruri-v3-310mmodel· 413k dl· ♡ 68413k dl♡ 68
- 🤗cl-nagoya/ruri-pt-basemodel· 346 dl· ♡ 3346 dl♡ 3
- 🤗cl-nagoya/ruri-pt-smallmodel· 2 dl· ♡ 12 dl♡ 1
- 🤗cl-nagoya/ruri-pt-largemodel· 4 dl· ♡ 24 dl♡ 2
- 🤗cl-nagoya/ruri-reranker-stage1-smallmodel· 4 dl4 dl
- 🤗cl-nagoya/ruri-reranker-smallmodel· 2.8k dl· ♡ 22.8k dl♡ 2
- 🤗cl-nagoya/ruri-reranker-stage1-basemodel· 5 dl5 dl
- 🤗cl-nagoya/ruri-reranker-stage1-largemodel· 5 dl· ♡ 15 dl♡ 1
- 🤗cl-nagoya/ruri-reranker-basemodel· 128 dl· ♡ 4128 dl♡ 4
- 🤗cl-nagoya/ruri-reranker-largemodel· 991 dl· ♡ 12991 dl♡ 12
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
