On the dynamic evolution of CLIP texture-shape bias and its relationship to human alignment and model robustness
Pablo Hern\'andez-C\'amara, Jose Manuel Ja\'en-Lorites, Alexandra G\'omez-Villa, Jorge Vila-Tom\'as, Valero Laparra, Jesus Malo

TL;DR
This paper analyzes how CLIP's internal representations evolve during training, revealing a trade-off between texture bias, perceptual alignment with humans, and robustness to noise, across different model scales.
Contribution
It provides the first epoch-by-epoch analysis of CLIP's representational dynamics, linking texture-shape bias, human perceptual alignment, and robustness during training.
Findings
Early training shows strong texture bias and high low-level perceptual alignment.
As training progresses, texture bias decreases and shape-based representations emerge.
Robustness to noise improves as texture bias diminishes.
Abstract
Contrastive language-image models such as CLIP have demonstrated remarkable generalization capabilities. However, how their internal visual representations evolve during training and how this evolution relates to human perception remains poorly understood. Most existing analysis characterize fully trained models, leaving the dynamics of representational biases and perceptual alignment largely unexplored. In this work, we present an epoch-by-epoch analysis of CLIP models throughout training, focusing on the evolution of texture-shape bias, alignment with human perceptual judgements, and sensitivity to image noise. Using multiple perceptual benchmarks spanning low-level image quality assessment, mid-level perceptual similarity, saliency correspondence, and noisy robustness, we identify a consistent, training-stage-dependent representational transition. Early training stages exhibit strong…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
