Controlling for Stereotypes in Multimodal Language Model Evaluation

Manuj Malik; Richard Johansson

arXiv:2302.01582·cs.CL·February 6, 2023

Controlling for Stereotypes in Multimodal Language Model Evaluation

Manuj Malik, Richard Johansson

PDF

Open Access

TL;DR

This paper introduces benchmarks to evaluate how multimodal language models rely on visual signals versus stereotypes, revealing differences in sensitivity among models like FLAVA, VisualBERT, and LXMERT.

Contribution

It presents a novel methodology and benchmark datasets for measuring stereotype reliance in multimodal models, highlighting model differences in sensitivity to visual cues.

Findings

01

FLAVA is less affected by stereotypes than older models

02

Models vary significantly in their reliance on visual signals

03

Controlled benchmarks reveal model sensitivities more clearly

Abstract

We propose a methodology and design two benchmark sets for measuring to what extent language-and-vision language models use the visual signal in the presence or absence of stereotypes. The first benchmark is designed to test for stereotypical colors of common objects, while the second benchmark considers gender stereotypes. The key idea is to compare predictions when the image conforms to the stereotype to predictions when it does not. Our results show that there is significant variation among multimodal models: the recent Transformer-based FLAVA seems to be more sensitive to the choice of image and less affected by stereotypes than older CNN-based models such as VisualBERT and LXMERT. This effect is more discernible in this type of controlled setting than in traditional evaluations where we do not know whether the model relied on the stereotype or the visual signal.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Multimodal Machine Learning Applications · Topic Modeling

MethodsTest · VisualBERT · Learning Cross-Modality Encoder Representations from Transformers · FLAVA