Exploring Compositionality in Vision Transformers using Wavelet Representations
Akshad Shyam Purushottamdas, Pranav K Nayak, Divya Mehul Rajparia, Deekshith Patel, Yashmitha Gogineni, Konda Reddy Mopuri, Sumohana S. Channappayya

TL;DR
This paper investigates the compositionality of Vision Transformer (ViT) representations using wavelet transforms, revealing that ViT encodings approximately compose from wavelet primitives, providing new insights into their information structure.
Contribution
Introduces a framework using Discrete Wavelet Transform to empirically test compositionality in ViT representations, bridging vision and language analysis methods.
Findings
Wavelet primitives can approximately reconstruct original representations.
ViT encodings exhibit compositional structure in latent space.
Provides a new perspective on how ViTs organize visual information.
Abstract
While insights into the workings of the transformer model have largely emerged by analysing their behaviour on language tasks, this work investigates the representations learnt by the Vision Transformer (ViT) encoder through the lens of compositionality. We introduce a framework, analogous to prior work on measuring compositionality in representation learning, to test for compositionality in the ViT encoder. Crucial to drawing this analogy is the Discrete Wavelet Transform (DWT), which is a simple yet effective tool for obtaining input-dependent primitives in the vision setting. By examining the ability of composed representations to reproduce original image representations, we empirically test the extent to which compositionality is respected in the representation space. Our findings show that primitives from a one-level DWT decomposition produce encoder representations that…
Peer Reviews
Decision·Submitted to ICLR 2025
* The paper presents a compelling idea — that DWT components could serve as primitives through which to study compositionality in ViTs.
Weaknesses: **Clarity**: The paper can be unclear at times over specifics of what was done or how, I think that further elaboration from the authors throughout could help strengthen the paper. * Figure 1: What does it mean here for the original image’s representation to be compared to the composed image representation? Does it mean that the maps we see in the figure are the result of of the comparison, or are these just the composed representations and the comparison was done outside of the fi
* The paper is generally well-written. * It builds on the well-established framework by Andreas (2019) for compositional representations. * Wavelets are a common and natural basis for image representations in signal processing.
* My main concern is that the presented results do not convincingly demonstrate compositionality. Rather than defining true combinations of wavelet primitive representations, it appears that the learned weights mainly select the low-pass filtered image (Table 3). Indeed, it is not particularly surprising that the images in Figure 5 perform similarly to the original images. * Compositionality is typically more valuable when components are semantic rather than appearance-based. It is doubtful tha
1. The paper investigates a novel approach to a significant problem. Importantly the authors show that learning a composition is significantly better than composition by addition. 2. The authors investigate the learning of composition to some extent - trying variant such as conic, convex and unconstrained.
1a. I believe the author's experimental setup is not using sufficient data. For reference, papers such as ViT-NeT which explores interpretability of ViT uses three datasets each of which is 10-20k images total. I'd suggest authors scale up their datasets - if we choose to use ImageNet-1k, a test set of at least 10 images per class would be more convincing. Currently, with a 15% test fraction, each class gets 1-2 images. I hesitate to make conclusions based on such a small test set per class. 1b
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeurobiology of Language and Bilingualism · Action Observation and Synchronization · Face Recognition and Perception
