LANGSAE EDITING: Improving Multilingual Information Retrieval via Post-hoc Language Identity Removal
Dongjun Kim, Jeongho Yoon, Chanjun Park, Heuiseok Lim

TL;DR
LANGSAE EDITING is a post-hoc method that removes language identity signals from multilingual embeddings, improving cross-language retrieval without retraining encoders.
Contribution
It introduces a controllable autoencoder-based approach to eliminate language signals in embeddings, enhancing multilingual retrieval performance.
Findings
Consistent improvements in ranking quality across multiple languages.
Significant gains for script-distinct languages.
Compatible with existing vector databases without retraining.
Abstract
Dense retrieval in multilingual settings often searches over mixed-language collections, yet multilingual embeddings encode language identity alongside semantics. This language signal can inflate similarity for same-language pairs and crowd out relevant evidence written in other languages. We propose LANGSAE EDITING, a post-hoc sparse autoencoder trained on pooled embeddings that enables controllable removal of language-identity signal directly in vector space. The method identifies language-associated latent units using cross-language activation statistics, suppresses these units at inference time, and reconstructs embeddings in the original dimensionality, making it compatible with existing vector databases without retraining the base encoder or re-encoding raw text. Experiments across multiple languages show consistent improvements in ranking quality and cross-language coverage, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Quality and Management · Topic Modeling · Information Retrieval and Search Behavior
