Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

Michael Aerni; Joshua Swanson; Kristina Nikoli\'c; Florian Tram\`er

arXiv:2510.21842·cs.CV·February 17, 2026

Modal Aphasia: Can Unified Multimodal Models Describe Images From Memory?

Michael Aerni, Joshua Swanson, Kristina Nikoli\'c, Florian Tram\`er

PDF

3 Reviews

TL;DR

This paper identifies a phenomenon called modal aphasia in unified multimodal models, where they memorize visual concepts well but struggle to articulate them in text, revealing a fundamental limitation with safety implications.

Contribution

The study systematically demonstrates modal aphasia as an inherent property of current multimodal models through experiments on real and synthetic data.

Findings

01

Models can reproduce iconic images accurately but fail in textual descriptions.

02

Modal aphasia is a fundamental property, not just a training artifact.

03

Safety risks arise as models can generate unsafe content across modalities.

Abstract

We present modal aphasia, a systematic dissociation in which current unified multimodal models accurately memorize concepts visually but fail to articulate them in writing, despite being trained on images and text simultaneously. For one, we show that leading frontier models can generate near-perfect reproductions of iconic movie artwork, but confuse crucial details when asked for textual descriptions. We corroborate those findings through controlled experiments on synthetic datasets in multiple architectures. Our experiments confirm that modal aphasia reliably emerges as a fundamental property of current unified multimodal models, not just as a training artifact. In practice, modal aphasia can introduce vulnerabilities in AI safety frameworks, as safeguards applied to one modality may leave harmful concepts accessible in other modalities. We demonstrate this risk by showing how a model…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

This paper is well written. It starts with a clear motivation that derives from our daily interactions with commercial multimodal models. Then, the authors employ an interesting study of generating visually vs. describing verbally upon movie posters in frontier models, showing the prevalence of the modal aphasia. Beyond observational experiments, the authors design crisp synthetic data and fine-tuning experiments to show that modal aphasia can stem from more than just naive image memorization, b

Weaknesses

1. The controlled experiments on open-source models only examine two open-source image generators. Also, the scale of test data is relatively small (below 200). 2. In the controlled fine-tuning for tracing the origin of modal aphasia, activating only the LLM backbone while freezing other components may not reflect real-world training.

Reviewer 02Rating 8Confidence 4

Strengths

S1: This paper is clearly written and easy to follow S2: The found problem of modality imbalance in image/text generation fidelity is of great importance, and the naming is fun and accurate. S3: The experiments to quantify and validate modal aphasia are well designed

Weaknesses

W1: This work focuses on the modality imbalance problem of MLLMs in image/text generation. There are related studies/benchmarks in image/text understanding about modality imbalance in VLMs, which are worth discussing to better position this work. W2: Since GPT5 is a proprietary model, there are rumors that its image generation is routed through another “sub-model” of GPT5. If so, the modality imbalance problem in image/text generation is kind of expected because of such a mismatch. I would like

Reviewer 03Rating 6Confidence 2

Strengths

* The paper introduces the new concept of modal aphasia, identifying a systematic dissociation between visual and textual understanding in unified multimodal models. This is an original and theoretically significant contribution that reframes existing assumptions about cross-modal knowledge transfer. * The authors demonstrate modal aphasia not only in frontier models such as ChatGPT-5, but also in controlled experiments with open-weight models (Janus-Pro, Harmon). This dual-level validation str

Weaknesses

* The paper successfully identifies modal aphasia as a systematic failure of multimodal models, but it does not offer a clear theoretical explanation or model-level mechanism to account for this behavior. The contribution remains largely descriptive rather than explanatory. * Most experiments are conducted on controlled synthetic datasets such as fictional faces and geometric patterns. While these setups enable variable control, they do not demonstrate whether modal aphasia persists in realisti

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.