Parallel Loop Transformer for Efficient Test-Time Computation Scaling
Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, Xingyan Bin

TL;DR
The paper introduces the Parallel Loop Transformer (PLT), a novel architecture that enables efficient, low-latency inference by parallelizing loop computations and sharing memory, maintaining high accuracy without increasing latency or memory use.
Contribution
The paper proposes the Parallel Loop Transformer (PLT), a new architecture that combines cross-loop parallelism and shared memory strategies to improve inference efficiency of looped transformers.
Findings
PLT achieves high accuracy comparable to traditional looped models.
PLT maintains low latency similar to standard transformers.
PLT reduces memory costs through shared KV cache and G-SWA.
Abstract
Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
