Latent Flow Transformer

Yen-Chen Wu; Feng-Ting Liao; Meng-Hsi Chen; Pei-Chen Ho; Farhang Nabiei; Da-shan Shiu

arXiv:2505.14513·cs.LG·May 21, 2025

Latent Flow Transformer

Yen-Chen Wu, Feng-Ting Liao, Meng-Hsi Chen, Pei-Chen Ho, Farhang Nabiei, Da-shan Shiu

PDF

Open Access 1 Repo

TL;DR

The paper introduces the Latent Flow Transformer, a novel approach that compresses transformer layers using flow matching and flow walking, achieving efficient model compression while maintaining performance.

Contribution

It presents the Latent Flow Transformer, which replaces multiple layers with a learned transport operator and introduces flow walking to improve flow-based model compression.

Findings

01

LFT compresses 6 of 24 layers, outperforming skipping 2 layers.

02

LFT with flow walking compresses 12 layers into one, reducing KL divergence.

03

The method narrows the gap between autoregressive and flow-based models.

Abstract

Transformers, the standard implementation for large language models (LLMs), typically consist of tens to hundreds of discrete layers. While more layers can lead to better performance, this approach has been challenged as far from efficient, especially given the superiority of continuous layers demonstrated by diffusion and flow-based models for image generation. We propose the Latent Flow Transformer (LFT), which replaces a block of layers with a single learned transport operator trained via flow matching, offering significant compression while maintaining compatibility with the original architecture. Additionally, we address the limitations of existing flow-based methods in \textit{preserving coupling} by introducing the Flow Walking (FW) algorithm. On the Pythia-410M model, LFT trained with flow matching compresses 6 of 24 layers and outperforms directly skipping 2 layers (KL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mtkresearch/latent-flow-transformer
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsData Stream Mining Techniques · Statistical and Computational Modeling

MethodsAttention Is All You Need · Linear Layer · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Multi-Head Attention · Dense Connections · Layer Normalization · Diffusion