Positional Bias in Multimodal Embedding Models: Do They Favor the Beginning, the Middle, or the End?
Kebin Wu, Fatima Albreiki

TL;DR
This paper investigates positional bias in multimodal image-text retrieval models, revealing modality-specific biases influenced by encoding schemes and training methods, which impact model performance.
Contribution
It is the first comprehensive study to analyze and distinguish positional bias effects in multimodal models across different modalities and datasets.
Findings
Text encoders favor the beginning of inputs.
Image encoders show bias at both start and end.
Bias is influenced by encoding schemes and training methods.
Abstract
Positional bias - where models overemphasize certain positions regardless of content - has been shown to negatively impact model performance across various tasks. While recent research has extensively examined positional bias in text generation models, its presence and effects in representation models remain underexplored. Even less is known about such biases in multimodal models. In this work, we investigate positional bias in multimodal representation models, specifically in the context of image-text retrieval. We begin by distinguishing between context importance and positional bias, and then assess the presence and extent of positional bias across different models and datasets. Our experiments demonstrate that positional bias is prevalent in multimodal models, but manifests differently across modalities: text encoders tend to exhibit bias toward the beginning of the input, whereas…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Generative Adversarial Networks and Image Synthesis
