Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Michal Golovanevsky; William Rudman; Michael Lepori; Amir Bar; Ritambhara Singh; Carsten Eickhoff

arXiv:2505.17127·cs.CV·September 30, 2025

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts

Michal Golovanevsky, William Rudman, Michael Lepori, Amir Bar, Ritambhara Singh, Carsten Eickhoff

PDF

Open Access 1 Datasets 1 Video

TL;DR

This paper investigates whether multimodal large language models rely more on memorized world knowledge or visual input, introducing a dataset and a control mechanism to steer model predictions between priors and visual evidence.

Contribution

It introduces Visual CounterFact, a dataset of counterfactual images, and Pixels Versus Priors (PvP), a method to control model reliance on priors versus visual input.

Findings

01

Models initially rely on priors but shift to visual evidence in later layers.

02

PvP effectively redirects model predictions from priors to visual input.

03

High success rate in controlling model outputs toward counterfactuals.

Abstract

Multimodal Large Language Models (MLLMs) perform well on tasks such as visual question answering, but it remains unclear whether their reasoning relies more on memorized world knowledge or on the visual information present in the input image. To investigate this, we introduce Visual CounterFact, a new dataset of visually-realistic counterfactuals that put world knowledge priors (e.g, red strawberry) into direct conflict with visual input (e.g, blue strawberry). Using Visual CounterFact, we show that model predictions initially reflect memorized priors, but shift toward visual evidence in mid-to-late layers. This dynamic reveals a competition between the two modalities, with visual input ultimately overriding priors during evaluation. To control this behavior, we propose Pixels Versus Priors (PvP) steering vectors, a mechanism for controlling model outputs toward either world knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

mgolov/Visual-Counterfact
dataset· 450 dl
450 dl

Videos

Pixels Versus Priors: Controlling Knowledge Priors in Vision-Language Models through Visual Counterfacts· underline

Taxonomy

TopicsMultimodal Machine Learning Applications