TL;DR
ELT introduces a parameter-efficient recurrent transformer architecture with iterative weight sharing and intra-loop self distillation, enabling high-quality image and video generation with dynamic trade-offs.
Contribution
The paper proposes Elastic Looped Transformers (ELT), a novel recurrent transformer model with weight sharing and ILSD, achieving efficient visual synthesis with flexible inference options.
Findings
Achieves 4x parameter reduction with comparable FID of 2.0 on ImageNet 256x256.
Introduces intra-loop self distillation for effective training of shared-parameter models.
Enables any-time inference with dynamic quality-computation trade-offs.
Abstract
We introduce Elastic Looped Transformers (ELT), a highly parameter-efficient class of visual generative models based on a recurrent transformer architecture. While conventional generative models rely on deep stacks of unique transformer layers, our approach employs iterative, weight-shared transformer blocks to drastically reduce parameter counts while maintaining high synthesis quality. To effectively train these models for image and video generation, we propose the idea of Intra-Loop Self Distillation (ILSD), where student configurations (intermediate loops) are distilled from the teacher configuration (maximum training loops) to ensure consistency across the model's depth in a single training step. Our framework yields a family of elastic models from a single training run, enabling Any-Time inference capability with dynamic trade-offs between computational cost and generation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
