LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal

Dongjun Kim; Jeongho Yoon; Chanjun Park; Heuiseok Lim

arXiv:2601.04768·cs.CL·January 9, 2026

LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal

Dongjun Kim, Jeongho Yoon, Chanjun Park, Heuiseok Lim

PDF

Open Access

TL;DR

LANGSAE EDITING is a post-hoc method that removes language identity signals from multilingual embeddings, improving cross-language retrieval without retraining encoders.

Contribution

It introduces a controllable autoencoder-based approach to eliminate language signals in embeddings, enhancing multilingual retrieval performance.

Findings

01

Consistent improvements in ranking quality across multiple languages.

02

Significant gains for script-distinct languages.

03

Compatible with existing vector databases without retraining.

Abstract

Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Quality and Management · Topic Modeling · Information Retrieval and Search Behavior