Controlling for Stereotypes in Multimodal Language Model Evaluation
Manuj Malik, Richard Johansson

TL;DR
This paper introduces benchmarks to evaluate how multimodal language models rely on visual signals versus stereotypes, revealing differences in sensitivity among models like FLAVA, VisualBERT, and LXMERT.
Contribution
It presents a novel methodology and benchmark datasets for measuring stereotype reliance in multimodal models, highlighting model differences in sensitivity to visual cues.
Findings
FLAVA is less affected by stereotypes than older models
Models vary significantly in their reliance on visual signals
Controlled benchmarks reveal model sensitivities more clearly
Abstract
We propose a methodology and design two benchmark sets for measuring to what extent language-and-vision language models use the visual signal in the presence or absence of stereotypes. The first benchmark is designed to test for stereotypical colors of common objects, while the second benchmark considers gender stereotypes. The key idea is to compare predictions when the image conforms to the stereotype to predictions when it does not. Our results show that there is significant variation among multimodal models: the recent Transformer-based FLAVA seems to be more sensitive to the choice of image and less affected by stereotypes than older CNN-based models such as VisualBERT and LXMERT. This effect is more discernible in this type of controlled setting than in traditional evaluations where we do not know whether the model relied on the stereotype or the visual signal.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling
MethodsTest · VisualBERT · Learning Cross-Modality Encoder Representations from Transformers · FLAVA
