Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

Elias Schuhmacher; Andrianos Michail; Juri Opitz; Rico Sennrich; Simon Clematide

arXiv:2601.16934·cs.CL·April 21, 2026

Information Representation Fairness in Long-Document Embeddings: The Peculiar Interaction of Positional and Language Bias

Elias Schuhmacher, Andrianos Michail, Juri Opitz, Rico Sennrich, Simon Clematide

PDF

1 Repo 1 Datasets

TL;DR

This paper investigates biases in long-document embeddings, revealing positional and language biases, and proposes an attention calibration method to improve segment discoverability, with an evaluation framework available online.

Contribution

It introduces a permutation-based evaluation framework for biases and an attention calibration method to mitigate positional and language biases in embeddings.

Findings

01

State-of-the-art models show positional and language biases in long documents.

02

Early segments and high-resource language segments are over-represented.

03

Attention calibration improves discoverability of later segments.

Abstract

To be discoverable in an embedding-based search process, each part of a document should be reflected in its embedding representation. To quantify any potential reflection biases, we introduce a permutation-based evaluation framework. With this, we observe that state-of-the-art embedding models exhibit systematic positional and language biases when documents are longer and consist of multiple segments. Specifically, early segments and segments in higher-resource languages like English are over-represented, while later segments and segments in lower-resource languages are marginalized. In our further analysis, we find that the positional bias stems from front-loaded attention distributions in pooling-token embeddings, where early tokens receive more attention. To mitigate this issue, we introduce an inference-time attention calibration method that redistributes attention more evenly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

impresso/fair-sentence-transformers
github

Datasets

impresso-project/wiki_comparable_corpus_en_de_hi_it_ko_zh
dataset· 348 dl
348 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.