Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias
Elias Schuhmacher, Andrianos Michail, Juri Opitz, Rico Sennrich, Simon Clematide

TL;DR
This paper investigates biases in long-document embeddings, revealing positional and language biases, and proposes an attention calibration method to improve segment discoverability, with an evaluation framework available online.
Contribution
It introduces a permutation-based evaluation framework for biases and an attention calibration method to mitigate positional and language biases in embeddings.
Findings
State-of-the-art models show positional and language biases in long documents.
Early segments and high-resource language segments are over-represented.
Attention calibration improves discoverability of later segments.
Abstract
To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
