Contrastive String Representation Learning using Synthetic Data

Urchade Zaratiana

arXiv:2110.04217·cs.CL·December 22, 2021

Contrastive String Representation Learning using Synthetic Data

Urchade Zaratiana

PDF

Open Access

TL;DR

This paper introduces a novel contrastive learning approach for string representation learning using only synthetic data, improving string similarity tasks in NLP.

Contribution

It presents a new synthetic data-based contrastive learning method for SRL, a relatively under-explored area in NLP.

Findings

01

Effective string similarity matching performance

02

Synthetic data suffices for training SRL models

03

Pretrained models and code will be publicly available

Abstract

String representation Learning (SRL) is an important task in the field of Natural Language Processing, but it remains under-explored. The goal of SRL is to learn dense and low-dimensional vectors (or embeddings) for encoding character sequences. The learned representation from this task can be used in many downstream application tasks such as string similarity matching or lexical normalization. In this paper, we propose a new method for to train a SRL model by only using synthetic data. Our approach makes use of Contrastive Learning in order to maximize similarity between related strings while minimizing it for unrelated strings. We demonstrate the effectiveness of our approach by evaluating the learned representation on the task of string similarity matching. Codes, data and pretrained models will be made publicly available.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Data Quality and Management

MethodsContrastive Learning