Causes and Consequences of Representational Similarity in Machine Learning Models
Zeyu Michael Li, Hung Anh Vu, Damilola Awofisayo, Emily Wenger

TL;DR
This paper investigates how dataset and task overlaps influence the similarity in representations across machine learning models and shows that higher similarity increases vulnerability to certain attacks.
Contribution
It is the first to systematically analyze the causes of representational similarity and its impact on model robustness across different modalities and sizes.
Findings
Dataset and task overlap increase representational similarity.
Greater similarity correlates with higher vulnerability to adversarial attacks.
Combining dataset and task overlap yields the strongest similarity effects.
Abstract
Numerous works have noted similarities in how machine learning models represent the world, even across modalities. Although much effort has been devoted to uncovering properties and metrics on which these models align, surprisingly little work has explored causes of this similarity. To advance this line of inquiry, this work explores how two factors - dataset overlap and task overlap - influence downstream model similarity. We evaluate the effects of both factors through experiments across model sizes and modalities, from small classifiers to large language models. We find that both task and dataset overlap cause higher representational similarity and that combining them provides the strongest effect. Finally, we consider downstream consequences of representational similarity, demonstrating how greater similarity increases vulnerability to transferable adversarial and jailbreak attacks.
Peer Reviews
Decision·Submitted to ICLR 2026
- Novel causal interventions on causes of model representation alignment. The paper systematically modifiies dataset overlap and task overlap, and measures how these correlate CKA/other alignment metrics for both ViTs and small LLMs. - Isolate task vs data overlap in representation similarities. Perform experiments on a number of different measures (CKA, nearest-neighbor scores, mutual KNN).
- Concerns about generality. In general, the results on language modeling seem to be in tension with the stated takeaways of the paper, and the authors do not properly explain this discrepancy. Namely, they note that for the Llama fine-tuning experiments, the models in general have low correlation between their CKA scores and dataset overlap. This is comparatively true for task overlap as well. This lower correlation is hypothesized to be due to the fact that the LLMs were fine-tuned, so a lower
* The paper studies an interesting question that can have large implications for building and evaluating foundation models. * The paper is well written and the experiments and results are described clearly. * The authors make an effort to make the underlying factors configurable and do a good job of holding one property constant (e.g. data overlap) while varying the other (e.g. task overlap).
* The different tasks that are studied for ColorShapeDigit800K are all simple character/digit/color classification tasks while the difference between different SSL objectives (multi-view, masking), image/text, supervised models, text models seem to be larger in my view. I’m not sure if it is possible to approximate this with these simple tasks. * The results that more tasks and data overlap increase representationals similarities feel a bit trivial. Although it is nice that the authors confirm
- Clearly motivated work addressing an important gap: trying to understand *causes* rather than just observing representational similarity - Well-designed methodology to isolate dataset and task overlap effects through controlled splitting - Broad empirical evaluation across multiple modalities (vision, language, diffusion) - Creative ColorShapeDigit800K dataset enabling task manipulation without dataset variation - Novel connection to downstream security implications (adversarial transferabilit
- Language model results are inconclusive and undermine the paper's central claims of dataset and task overlap strongly influencing the representational similarity: - Figure 19 (Llama fine-tuning): Only small positive correlations appear with CKA for task splitting; data splitting and local similarity measures show no effect from increasing overlap. Authors attribute this to LoRA without further hypothesis. Can the authors hypothesize beyond "LoRA limitations"? Is this a fundamental differen
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical and Computational Modeling · Neural Networks and Applications
MethodsSparse Evolutionary Training
