LayerCache: Exploiting Layer-wise Velocity Heterogeneity for Efficient Flow Matching Inference
Guandong Li

TL;DR
LayerCache introduces a layer-wise caching strategy for flow matching models, exploiting heterogeneity in layer dynamics to significantly improve inference efficiency and image quality.
Contribution
It proposes a novel layer-aware caching framework with adaptive scheduling, outperforming prior methods by exploiting layer heterogeneity in Transformer-based models.
Findings
Achieves 1.37x speedup with improved image quality metrics.
Reduces LPIPS by 70% compared to prior caching methods.
Outperforms all prior caching methods on the quality-speed Pareto frontier.
Abstract
Flow Matching models achieve state-of-the-art image generation quality but incur substantial inference cost due to iterative denoising through large Transformer networks. We observe that different layer groups within a Transformer exhibit markedly heterogeneous velocity dynamics: shallow layers are highly stable and amenable to aggressive caching, while deep layers undergo large velocity changes that demand full computation. Existing caching methods, however, treat the entire Transformer as a monolithic unit, applying a single caching decision per timestep and thus failing to exploit this heterogeneity. Based on this finding, we propose LayerCache, a layer-aware caching framework that partitions the Transformer into layer groups and makes independent, per-group caching decisions at each denoising step. LayerCache introduces an adaptive JVP span K selection mechanism that leverages…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
