Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

Arya Shah; Vaibhav Tripathi; Mayank Singh; Chaklam Silpasuwanchai

arXiv:2604.13803·cs.CV·April 16, 2026

Gaslight, Gatekeep, V1-V3: Early Visual Cortex Alignment Shields Vision-Language Models from Sycophantic Manipulation

Arya Shah, Vaibhav Tripathi, Mayank Singh, Chaklam Silpasuwanchai

PDF

1 Repo 1 Datasets

TL;DR

This study finds that vision-language models with visual representations aligned to early human visual cortex are more resistant to manipulative prompts, highlighting the importance of brain-like visual encoding for AI safety.

Contribution

It demonstrates a specific link between early visual cortex alignment and reduced susceptibility to sycophantic manipulation in vision-language models.

Findings

01

Early visual cortex alignment negatively correlates with model sycophancy.

02

All tested models showed this negative correlation in ROI analysis.

03

Faithful low-level visual encoding acts as an anchor against linguistic manipulation.

Abstract

Vision-language models are increasingly deployed in high-stakes settings, yet their susceptibility to sycophantic manipulation remains poorly understood, particularly in relation to how these models represent visual information internally. Whether models whose visual representations more closely mirror human neural processing are also more resistant to adversarial pressure is an open question with implications for both neuroscience and AI safety. We investigate this question by evaluating 12 open-weight vision-language models spanning 6 architecture families and a 40 $\times$ parameter range (256M--10B) along two axes: brain alignment, measured by predicting fMRI responses from the Natural Scenes Dataset across 8 human subjects and 6 visual cortex regions of interest, and sycophancy, measured through 76,800 two-turn gaslighting prompts spanning 5 categories and 10 difficulty levels.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

aryashah2k/Gaslight-Gatekeep-Sycophantic-Manipulation
github

Datasets

aryashah00/Gaslight-Gatekeep-V1-V3
dataset· 105 dl
105 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.