FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

Xuan Shen; Weize Ma; Yufa Zhou; Enhao Tang; Yanyue Xie; Zhengang Li; Yifan Gong; Quanyi Wang; Henghui Ding; Yiwei Wang; Yanzhi Wang; Pu Zhao; Jun Lin; Jiuxiang Gu

arXiv:2505.14709·cs.CV·May 22, 2025

FastCar: Cache Attentive Replay for Fast Auto-Regressive Video Generation on the Edge

Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu

PDF

Open Access 1 Repo 3 Reviews

TL;DR

FastCar introduces a cache attentive replay method with FPGA acceleration to significantly speed up auto-regressive video generation on edge devices by exploiting temporal redundancy and optimizing resource usage.

Contribution

The paper proposes FastCar, a novel framework that leverages temporal redundancy and a hardware accelerator to accelerate AR video decoding, outperforming traditional methods in speed and energy efficiency.

Findings

01

Over 2.1x decoding speedup on edge devices

02

Higher energy efficiency compared to traditional sparse attention

03

Effective combination of FastCar with sparse attention for high-resolution videos

Abstract

Auto-regressive (AR) models, initially successful in language generation, have recently shown promise in visual generation tasks due to their superior sampling efficiency. Unlike image generation, video generation requires a substantially larger number of tokens to produce coherent temporal frames, resulting in significant overhead during the decoding phase. Our key observations are: (i) MLP modules in the decode phase dominate the inference latency, and (ii) there exists high temporal redundancy in MLP outputs of adjacent frames. In this paper, we propose the \textbf{FastCar} framework to accelerate the decode phase for the AR video generation by exploring the temporal redundancy. The Temporal Attention Score (TAS) is proposed to determine whether to apply the replay strategy (\textit{i.e.}, reusing cached MLP outputs from the previous frame to reduce redundant computations) with…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

1. The paper is well-written. The proposed method is well-illustrated and easy to follow. 2. **The proposed method is simple and efficient:** Using the Temporal Attention Score (TAS)—a metric that is effectively "free" as it is derived from pre-existing attention calculations—to guide the caching strategy is a very clever design choice. This avoids the overhead that often plagues other dynamic execution methods. 3. **Comprehensive empirical evaluation:** The paper presents a robust set of expe

Weaknesses

1. **Limited architectural generalization:** The experiments and analysis are conducted exclusively on the VILA-U model. While the authors correctly note the lack of other open-source AR video models, this raises a question about the generality of the core insight. The MLP-bottleneck observation may be specific to this particular architecture's configuration (e.g., hidden size, MLP expansion factor). The paper would be stronger if it included analysis on other AR models (even from image or langu

Reviewer 02Rating 6Confidence 4

Strengths

1. Strong empirical motivation: The paper provides compelling profiling evidence that MLPs, not attention, are the bottleneck in AR video decoding, which justifies shifting optimization focus away from KV-caching or sparse attention (common in LLMs) toward MLP replay. Novel and well-motivated algorithmic component: The Temporal Attention Score (TAS) is a simple yet effective proxy for temporal similarity that incurs zero extra compute (as it reuses existing attention logits). The theoretical ana

Weaknesses

1. Threshold selection is manual: The replay threshold is tuned empirically. The paper shows robustness across thresholds (Fig. 4), but does not propose an adaptive or learned thresholding strategy, which could improve usability in dynamic real-world scenarios. 2. Limited evaluation and comparison: Evaluation limited to VILA-U, comparison limited to StreamingLLM. Yes, VILA-U is the only open-source AR video generation without diffusion model, but it may indicate that the community and business a

Reviewer 03Rating 4Confidence 2

Strengths

The paper performs actual latency profiling and shows that, in autoregressive video decoding, the main bottleneck is the MLP/FFN block rather than attention. This is valuable because most prior acceleration work focuses on attention; here the motivation is concrete and data-driven. The core mechanism of FastCar — reusing the previous frame’s MLP outputs for tokens that barely change, instead of recomputing them every step — is straightforward and can be applied at inference time without retrain

Weaknesses

1. The core idea — detect tokens/regions that barely change over time and skip recomputing them by reusing cached features — is not fundamentally new. Similar per-token reuse / feature caching / dynamic skipping ideas have already appeared in video and diffusion generation acceleration. The paper mainly adapts this intuition to autoregressive video decoding, adds a specific gating signal (TAS), and wraps it with an FPGA story. This feels more like an incremental systemization than a genuinely ne

Code & Models

Repositories

shawnricecake/fast-car
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Coding and Compression Technologies · Advanced Image and Video Retrieval Techniques · Image and Video Quality Assessment

MethodsSoftmax · Attention Is All You Need