Japanese SimCSE Technical Report
Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda

TL;DR
This paper introduces Japanese SimCSE, a set of sentence embedding models for Japanese, developed through extensive experiments with various models and datasets, providing a new baseline for Japanese sentence embedding research.
Contribution
The paper presents Japanese SimCSE models and detailed training and evaluation procedures, filling the gap of Japanese sentence embedding baselines.
Findings
Japanese SimCSE models achieve competitive performance.
Extensive experiments on 24 models and multiple datasets.
Provides detailed training setup and evaluation results.
Abstract
We report the development of Japanese SimCSE, Japanese sentence embedding models fine-tuned with SimCSE. Since there is a lack of sentence embedding models for Japanese that can be used as a baseline in sentence embedding research, we conducted extensive experiments on Japanese sentence embeddings involving 24 pre-trained Japanese or multilingual language models, five supervised datasets, and four unsupervised datasets. In this report, we provide the detailed training setup for Japanese SimCSE and their evaluation results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsSimCSE
