Latent Flow Transformer
Yen-Chen Wu, Feng-Ting Liao, Meng-Hsi Chen, Pei-Chen Ho, Farhang Nabiei, Da-shan Shiu

TL;DR
The paper introduces the Latent Flow Transformer, a novel approach that compresses transformer layers using flow matching and flow walking, achieving efficient model compression while maintaining performance.
Contribution
It presents the Latent Flow Transformer, which replaces multiple layers with a learned transport operator and introduces flow walking to improve flow-based model compression.
Findings
LFT compresses 6 of 24 layers, outperforming skipping 2 layers.
LFT with flow walking compresses 12 layers into one, reducing KL divergence.
The method narrows the gap between autoregressive and flow-based models.
Abstract
Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in \textit{preserving coupling} by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsData Stream Mining Techniques · Statistical and Computational Modeling
MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion
