
TL;DR
This paper investigates how different vision language models encode political and cultural categories in their latent spaces, revealing model-specific sensitivities and biases shaped by training data and architecture, with implications for digital art history.
Contribution
It introduces the concepts of computational latent politicization, emergent bias, and distinct algorithmic scopic regimes, providing a critical framework for understanding model-specific perceptual sensitivities.
Findings
SigLIP classifies 59.4% of artworks as politically engaged, unlike OpenCLIP's 4%.
African masks are highly political in SigLIP but apolitical in OpenAI CLIP.
Discrepancies in aesthetic colonial axes reach 72.6 percentage points.
Abstract
This study challenges the presumed neutrality of latent spaces in vision language models (VLMs) by adopting an ethological perspective on their algorithmic behaviors. Rather than constituting spaces of homogeneous indeterminacy, latent spaces exhibit model-specific algorithmic sensitivities, understood as differential regimes of perceptual salience shaped by training data and architectural choices. Through a comparative analysis of three models (OpenAI CLIP, OpenCLIP LAION, SigLIP) applied to a corpus of 301 artworks (15th to 20th), we reveal substantial divergences in the attribution of political and cultural categories. Using bipolar semantic axes derived from vector analogies (Mikolov et al., 2013), we show that SigLIP classifies 59.4% of the artworks as politically engaged, compared to only 4% for OpenCLIP. African masks receive the highest political scores in SigLIP while…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsLanguage, Metaphor, and Cognition · Embodied and Extended Cognition · Language and cultural evolution
