Training Effective Neural Sentence Encoders from Automatically Mined   Paraphrases

S{\l}awomir Dadas

arXiv:2207.12759·cs.CL·July 27, 2022

Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases

S{\l}awomir Dadas

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces a method to train high-quality language-specific sentence encoders using automatically mined paraphrases from bilingual corpora, avoiding manual labeling and achieving strong performance in less-resourced languages.

Contribution

The authors propose a novel approach to create sentence encoders for low-resource languages by automatically generating paraphrase datasets from bilingual texts, bypassing the need for manual annotations.

Findings

01

High performance on Polish sentence tasks

02

Training takes less than a day on a single GPU

03

Outperforms existing multilingual encoders

Abstract

Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence pairs. Sufficient amount of annotated data is available for high-resource languages such as English or Chinese. In less popular languages, multilingual models have to be used, which offer lower performance. In this publication, we address this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer. Our sentence encoder can be trained in less than a day on a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sdadas/polish-sentence-evaluation
pytorchOfficial

Datasets

mteb/PpcPC
dataset· 61 dl
61 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Dropout · Adam · Byte Pair Encoding · Label Smoothing · Layer Normalization