Sparse Visual Thought Circuits in Vision-Language Models
Yunpeng Zhou

TL;DR
This paper investigates whether sparse autoencoder features in vision-language models form modular units for reasoning, finding that they often do not, which impacts interpretability and control methods.
Contribution
It introduces a causal pipeline to localize and test sparse visual thought circuits, revealing non-modular interactions and providing a diagnostic framework for VLM interpretability.
Findings
Intervening on combined feature sets causes output drift and accuracy degradation.
SAE features often share internal pathways, reducing modularity.
The framework is validated across multiple datasets and model families.
Abstract
Sparse autoencoders (SAEs) improve interpretability in multimodal models, but it remains unclear whether SAE features form modular, composable units for reasoning-an assumption underlying many intervention-based steering methods. We test this modularity hypothesis and find it often fails: intervening on a task-selective feature set can modestly improve reasoning accuracy, while intervening on the union of two such sets reliably induces output drift (large unintended changes in predictions) and degrades accuracy, even under norm-matched perturbations. This non modular circuit interference is consistent with shared internal pathways where feature unions amplify activation shifts. We develop a reproducible causal pipeline to localize and test these sparse visual thought circuits in Qwen3-VL-8B. On a controlled synthetic benchmark with seven task types and three difficulty levels, linear…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Domain Adaptation and Few-Shot Learning
