A Comparative analysis of Layer-wise Representational Capacity in AR and Diffusion LLMs
Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli

TL;DR
This paper compares internal representations of autoregressive and diffusion language models, revealing how diffusion objectives induce redundancy that can be exploited for model compression without significant performance loss.
Contribution
It provides the first layer-wise and token-wise analysis comparing AR and diffusion LLMs, highlighting how diffusion objectives create redundancy enabling efficient model compression.
Findings
Diffusion models have more global, redundant representations early in layers.
AR models have tightly coupled, locally structured representations.
Diffusion-trained models retain performance with up to 18.75% FLOPs reduction.
Abstract
Autoregressive (AR) language models build representations incrementally via left-to-right prediction, while diffusion language models (dLLMs) are trained through full-sequence denoising. Although recent dLLMs match AR performance, whether diffusion objectives fundamentally reshape internal representations remains unclear. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B), using cosine similarity across layers and tokens alongside static inference-time layer-skipping as an analytical probe of redundancy. We find that diffusion objectives produce more global representations with substantial early-layer redundancy and reduced recency bias, while AR objectives yield tightly coupled, locally structured representations. AR-initialized dLLMs retain AR-like dynamics despite…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
