Exploring Compositionality in Vision Transformers using Wavelet Representations

Akshad Shyam Purushottamdas; Pranav K Nayak; Divya Mehul Rajparia; Deekshith Patel; Yashmitha Gogineni; Konda Reddy Mopuri; Sumohana S. Channappayya

arXiv:2512.24438·cs.CV·January 1, 2026

Exploring Compositionality in Vision Transformers using Wavelet Representations

Akshad Shyam Purushottamdas, Pranav K Nayak, Divya Mehul Rajparia, Deekshith Patel, Yashmitha Gogineni, Konda Reddy Mopuri, Sumohana S. Channappayya

PDF

Open Access 3 Reviews

TL;DR

This paper investigates the compositionality of Vision Transformer (ViT) representations using wavelet transforms, revealing that ViT encodings approximately compose from wavelet primitives, providing new insights into their information structure.

Contribution

Introduces a framework using Discrete Wavelet Transform to empirically test compositionality in ViT representations, bridging vision and language analysis methods.

Findings

01

Wavelet primitives can approximately reconstruct original representations.

02

ViT encodings exhibit compositional structure in latent space.

03

Provides a new perspective on how ViTs organize visual information.

Abstract

While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

* The paper presents a compelling idea — that DWT components could serve as primitives through which to study compositionality in ViTs.

Weaknesses

Weaknesses: **Clarity**: The paper can be unclear at times over specifics of what was done or how, I think that further elaboration from the authors throughout could help strengthen the paper. * Figure 1: What does it mean here for the original image’s representation to be compared to the composed image representation? Does it mean that the maps we see in the figure are the result of of the comparison, or are these just the composed representations and the comparison was done outside of the fi

Reviewer 02Rating 5Confidence 3

Strengths

* The paper is generally well-written. * It builds on the well-established framework by Andreas (2019) for compositional representations. * Wavelets are a common and natural basis for image representations in signal processing.

Weaknesses

* My main concern is that the presented results do not convincingly demonstrate compositionality. Rather than defining true combinations of wavelet primitive representations, it appears that the learned weights mainly select the low-pass filtered image (Table 3). Indeed, it is not particularly surprising that the images in Figure 5 perform similarly to the original images. * Compositionality is typically more valuable when components are semantic rather than appearance-based. It is doubtful tha

Reviewer 03Rating 6Confidence 3

Strengths

1. The paper investigates a novel approach to a significant problem. Importantly the authors show that learning a composition is significantly better than composition by addition. 2. The authors investigate the learning of composition to some extent - trying variant such as conic, convex and unconstrained.

Weaknesses

1a. I believe the author's experimental setup is not using sufficient data. For reference, papers such as ViT-NeT which explores interpretability of ViT uses three datasets each of which is 10-20k images total. I'd suggest authors scale up their datasets - if we choose to use ImageNet-1k, a test set of at least 10 images per class would be more convincing. Currently, with a 15% test fraction, each class gets 1-2 images. I hesitate to make conclusions based on such a small test set per class. 1b

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeurobiology of Language and Bilingualism · Action Observation and Synchronization · Face Recognition and Perception