ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched   Embeddings

Jangyeong Jeon; Sangyeon Cho; Minuk Ma; and Junyoung Kim

arXiv:2409.00120·cs.CL·December 23, 2024

ConCSE: Unified Contrastive Learning and Augmentation for Code-Switched Embeddings

Jangyeong Jeon, Sangyeon Cho, Minuk Ma, and Junyoung Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces ConCSE, a novel contrastive learning method for code-switched embeddings, validated on a new English-Korean dataset, demonstrating improved semantic similarity performance.

Contribution

It presents a new unified contrastive learning and augmentation approach specifically designed for code-switched language embeddings, along with a new dataset for English-Korean CS scenarios.

Findings

01

ConCSE improves semantic similarity scores by 1.77% on Koglish-STS.

02

The Koglish dataset highlights the need for CS-specific resources.

03

Multilingual models show differential performance on monolingual versus CS data.

Abstract

This paper examines the Code-Switching (CS) phenomenon where two languages intertwine within a single utterance. There exists a noticeable need for research on the CS between English and Korean. We highlight that the current Equivalence Constraint (EC) theory for CS in other languages may only partially capture English-Korean CS complexities due to the intrinsic grammatical differences between the languages. We introduce a novel Koglish dataset tailored for English-Korean CS scenarios to mitigate such challenges. First, we constructed the Koglish-GLUE dataset to demonstrate the importance and need for CS datasets in various tasks. We found the differential outcomes of various foundation multilingual language models when trained on a monolingual versus a CS dataset. Motivated by this, we hypothesized that SimCSE, which has shown strengths in monolingual sentence embedding, would have…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jjy961228/ConCSE
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems

MethodsSimCSE · Contrastive Learning