Japanese SimCSE Technical Report

Hayato Tsukagoshi; Ryohei Sasano; Koichi Takeda

arXiv:2310.19349·cs.CL·October 31, 2023·1 cites

Japanese SimCSE Technical Report

Hayato Tsukagoshi, Ryohei Sasano, Koichi Takeda

PDF

Open Access 1 Repo 2 Models

TL;DR

This paper introduces Japanese SimCSE, a set of sentence embedding models for Japanese, developed through extensive experiments with various models and datasets, providing a new baseline for Japanese sentence embedding research.

Contribution

The paper presents Japanese SimCSE models and detailed training and evaluation procedures, filling the gap of Japanese sentence embedding baselines.

Findings

01

Japanese SimCSE models achieve competitive performance.

02

Extensive experiments on 24 models and multiple datasets.

03

Provides detailed training setup and evaluation results.

Abstract

We report the development of Japanese SimCSE, Japanese sentence embedding models fine-tuned with SimCSE. Since there is a lack of sentence embedding models for Japanese that can be used as a baseline in sentence embedding research, we conducted extensive experiments on Japanese sentence embeddings involving 24 pre-trained Japanese or multilingual language models, five supervised datasets, and four unsupervised datasets. In this report, we provide the detailed training setup for Japanese SimCSE and their evaluation results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hpprc/simple-simcse-ja
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsSimCSE