Training Effective Neural Sentence Encoders from Automatically Mined Paraphrases
S{\l}awomir Dadas

TL;DR
This paper introduces a method to train high-quality language-specific sentence encoders using automatically mined paraphrases from bilingual corpora, avoiding manual labeling and achieving strong performance in less-resourced languages.
Contribution
The authors propose a novel approach to create sentence encoders for low-resource languages by automatically generating paraphrase datasets from bilingual texts, bypassing the need for manual annotations.
Findings
High performance on Polish sentence tasks
Training takes less than a day on a single GPU
Outperforms existing multilingual encoders
Abstract
Sentence embeddings are commonly used in text clustering and semantic retrieval tasks. State-of-the-art sentence representation methods are based on artificial neural networks fine-tuned on large collections of manually labeled sentence pairs. Sufficient amount of annotated data is available for high-resource languages such as English or Chinese. In less popular languages, multilingual models have to be used, which offer lower performance. In this publication, we address this problem by proposing a method for training effective language-specific sentence encoders without manually labeled data. Our approach is to automatically construct a dataset of paraphrase pairs from sentence-aligned bilingual text corpora. We then use the collected data to fine-tune a Transformer language model with an additional recurrent pooling layer. Our sentence encoder can be trained in less than a day on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Dropout · Adam · Byte Pair Encoding · Label Smoothing · Layer Normalization
