Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint
Dayeon Ki, Cheonbok Park, Hyunjoong Kim

TL;DR
This paper introduces ORACLE, a novel training method that enforces orthogonality between semantic and language embeddings to reduce semantic leakage and improve cross-lingual sentence representation alignment.
Contribution
The paper proposes a new orthogonality constraint-based training objective, ORACLE, to effectively disentangle semantics and language in cross-lingual embeddings.
Findings
ORACLE reduces semantic leakage in cross-lingual embeddings.
Improved semantic alignment enhances retrieval and similarity tasks.
Method outperforms existing disentanglement approaches.
Abstract
Accurately aligning contextual representations in cross-lingual sentence embeddings is key for effective parallel data mining. A common strategy for achieving this alignment involves disentangling semantics and language in sentence embeddings derived from multilingual pre-trained models. However, we discover that current disentangled representation learning methods suffer from semantic leakage - a term we introduce to describe when a substantial amount of language-specific information is unintentionally leaked into semantic representations. This hinders the effective disentanglement of semantic and language representations, making it difficult to retrieve embeddings that distinctively represent the meaning of the sentence. To address this challenge, we propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE), tailored to enforce orthogonality between semantic and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText Readability and Simplification · Access Control and Trust · Topic Modeling
