Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint

Dayeon Ki; Cheonbok Park; Hyunjoong Kim

arXiv:2409.15664·cs.CL·September 3, 2025

Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint

Dayeon Ki, Cheonbok Park, Hyunjoong Kim

PDF

Open Access 1 Repo

TL;DR

This paper introduces ORACLE, a novel training method that enforces orthogonality between semantic and language embeddings to reduce semantic leakage and improve cross-lingual sentence representation alignment.

Contribution

The paper proposes a new orthogonality constraint-based training objective, ORACLE, to effectively disentangle semantics and language in cross-lingual embeddings.

Findings

01

ORACLE reduces semantic leakage in cross-lingual embeddings.

02

Improved semantic alignment enhances retrieval and similarity tasks.

03

Method outperforms existing disentanglement approaches.

Abstract

Accurately aligning contextual representations in cross-lingual sentence embeddings is key for effective parallel data mining. A common strategy for achieving this alignment involves disentangling semantics and language in sentence embeddings derived from multilingual pre-trained models. However, we discover that current disentangled representation learning methods suffer from semantic leakage - a term we introduce to describe when a substantial amount of language-specific information is unintentionally leaked into semantic representations. This hinders the effective disentanglement of semantic and language representations, making it difficult to retrieve embeddings that distinctively represent the meaning of the sentence. To address this challenge, we propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE), tailored to enforce orthogonality between semantic and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dayeonki/oracle
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText Readability and Simplification · Access Control and Trust · Topic Modeling