Neuroscience-Inspired Analyses of Visual Interestingness in Multimodal Transformers
Mathis Immertreu, Fitim Abdullahu, Thomas Kinfe, Helmut Grabner, Patrick Krauss, Achim Schilling

TL;DR
This study investigates how multimodal transformers encode visual interestingness, revealing that human-like interest signals are linearly decodable and emerge progressively across model layers, paralleling neural principles.
Contribution
It demonstrates that visual interestingness is encoded in transformer models in a structured, layer-wise manner similar to neural processes, without explicit supervision.
Findings
CI information is linearly decodable from final-layer embeddings.
Intermediate layers encode emergent, distinguishable representations of visual interest.
Higher layers converge on robust concept vectors indicating interest encoding.
Abstract
Human attention is the gateway to conscious perception, memory and decision-making. However, its role in modern transformer models remains largely unexplored. As these systems increasingly influence what people see, prefer and buy, the question arises as to whether they encode principles of human interest or merely exploit large-scale correlations. Addressing this issue is crucial for understanding cognition and ensuring the responsible use of AI in communication and marketing. In order to address this issue, the concept of visual interest was examined within the multimodal vision-language-model Qwen3-VL-8B, using a pre-defined Common Interestingness (CI) score derived from large-scale human engagement data on the photo-sharing platform Flickr. Here, we analyzed internal representations across vision and language components using methods from the neurosciences. Our analyses revealed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
