When Language Models Lose Their Mind: The Consequences of Brain Misalignment

Gabriele Merlin; Mariya Toneva

arXiv:2603.23091·cs.CL·March 25, 2026

When Language Models Lose Their Mind: The Consequences of Brain Misalignment

Gabriele Merlin, Mariya Toneva

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the importance of brain alignment in large language models by comparing well-aligned and misaligned models across numerous linguistic tasks, revealing that misalignment significantly hampers language understanding.

Contribution

It introduces brain-misaligned models to study the impact of brain alignment on linguistic performance, providing new insights into neural and linguistic relationship in LLMs.

Findings

01

Brain misalignment impairs downstream linguistic performance

02

Brain alignment is crucial for robust language understanding

03

Misaligned models perform worse across diverse linguistic tasks

Abstract

While brain-aligned large language models (LLMs) have garnered attention for their potential as cognitive models and for potential for enhanced safety and trustworthiness in AI, the role of this brain alignment for linguistic competence remains uncertain. In this work, we investigate the functional implications of brain alignment by introducing brain-misaligned models--LLMs intentionally trained to predict brain activity poorly while maintaining high language modeling performance. We evaluate these models on over 200 downstream tasks encompassing diverse linguistic domains, including semantics, syntax, discourse, reasoning, and morphology. By comparing brain-misaligned models with well-matched brain-aligned counterparts, we isolate the specific impact of brain alignment on language understanding. Our experiments reveal that brain misalignment substantially impairs downstream…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 1

Strengths

1. The main contribution is proposing a new paradigm beyond simple correlation studies. The paper creates “brain de-aligned” models through adversarial training, with “brain retention” models as controls. This is a well-designed intervention experiment, providing a new tool for studying the function of representations. 2. Using the Holmes benchmark to evaluate over 200 fine-grained linguistic tasks is a major strength. It allows the authors to see how de-alignment affects various aspects of lang

Weaknesses

1. The key findings are only based on bert-base and gpt-small. In 2025, using these models is not enough to study emergent properties like brain alignment in LLMs. Many key language abilities and representations only show up in much larger models (for example, over 7B parameters). Do these findings hold for modern models like Llama or Qwen? 2. On the Harry Potter dataset, the difference in GPT-2 model performance is not statistically significant (p=0.055). On the Moth Radio Hour dataset, the au

Reviewer 02Rating 6Confidence 5

Strengths

1. Explicitly training models for brain misalignment is novel as far as I am aware. This is a cool method for causally testing the effect of neural data regularization in both directions. 2. Great control: models trained for misalignment are compared with models under the same training pipeline but with permuted stimulus-response pairs. 3. Methods and results are presented clearly, making the paper easy to follow.

Weaknesses

### 1. Small scale of tested models The models under consideration (BERT-base and GPT-small) are very small by modern standards, limiting the applicability of claims to the state of the art. In particular, smaller models might lack capacity to retain high linguistic performance despite reduced brain alignment, whereas larger models' performance might be affected by brain-misalignment less strongly or not at all. ### 2. Small effect size The average win rate between misaligned model and control

Reviewer 03Rating 4Confidence 3

Strengths

- The paper is conceptually original in treating brain–LM alignment as an intervenable property rather than a purely correlational observation. Constructing “brain-misaligned” variants via adversarial fine-tuning and contrasting them with a permutation-based control is a creative way to probe causal relevance. - Methodologically, the study is careful about a manipulation check (reduced brain predictivity in language ROIs) and about holding general LM performance roughly fixed during selection,

Weaknesses

- The framing somewhat overstates what the intervention establishes. Studying the effects of removing alignment does not automatically prove that alignment, per se, is beneficial. Removing brain–LM similarity and observing performance drops suggests that aligned information may correlate with linguistic competence, but it does not directly prove that alignment itself is necessary or causally beneficial. How do we ensure that the suggested manipulation does not act as structured noise or remove s

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Multimodal Machine Learning Applications · EEG and Brain-Computer Interfaces