Universal YOCO for Efficient Depth Scaling

Yutao Sun; Li Dong; Tianzhu Ye; Shaohan Huang; Jianyong Wang; Furu Wei

arXiv:2604.01220·cs.CL·April 2, 2026

Universal YOCO for Efficient Depth Scaling

Yutao Sun, Li Dong, Tianzhu Ye, Shaohan Huang, Jianyong Wang, Furu Wei

PDF

TL;DR

YOCO-U combines YOCO architecture with recursive computation to enable efficient, scalable inference in large language models, improving token utility and performance with limited overhead.

Contribution

It introduces YOCO-U, a novel architecture that synergistically integrates YOCO and recursion for efficient depth scaling in LLMs.

Findings

01

YOCO-U achieves a favorable tradeoff between capability and efficiency.

02

It maintains a constant global KV cache and linear pre-filling.

03

Empirical results show YOCO-U's competitiveness on benchmarks.

Abstract

The rise of test-time scaling has remarkably boosted the reasoning and agentic proficiency of Large Language Models (LLMs). Yet, standard Transformers struggle to scale inference-time compute efficiently, as conventional looping strategies suffer from high computational overhead and a KV cache that inflates alongside model depth. We present Universal YOCO (YOCO-U), which combines the YOCO decoder-decoder architecture with recursive computation to achieve a synergistic effect greater than either alone. Built on the YOCO framework, YOCO-U implements a Universal Self-Decoder that performs multiple iterations via parameter sharing, while confining the iterative process to shallow, efficient-attention layers. This combination yields a favorable capability-efficiency tradeoff that neither YOCO nor recursion achieves independently. The YOCO architecture provides a constant global KV cache and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.