Block-Based Double Decoders

Asher Labovich; Benjamin Bradley; Vanessa Alexander; Chaitanya Harsha

arXiv:2605.18807·cs.LG·May 20, 2026

Block-Based Double Decoders

Asher Labovich, Benjamin Bradley, Vanessa Alexander, Chaitanya Harsha

PDF

TL;DR

The paper introduces block-based double decoders, a transformer architecture that combines the training efficiency of decoder-only models with the inference advantages of encoder-decoder models, achieving significant resource savings.

Contribution

It proposes a novel doubly-causal block-based attention mechanism enabling full supervision training and static sequence packing in transformer models.

Findings

01

Outperforms encoder-decoders in scaling law experiments.

02

Reduces KV-cache memory and per-token compute by at least two-thirds.

03

Maintains prefill caching and inference optimizations of decoder-only models.

Abstract

Encoder-decoder models offer substantial inference-time savings over decoder-only models, but their pretraining objectives suffer from sparse supervision and dynamic sequence lengths, keeping them out of practice at scale. We propose block-based double decoders, a novel transformer architecture that utilizes doubly-causal block-based attention masks to train with full loss supervision and static sequence packing, combining decoder-only training efficiency with encoder-decoder inference efficiency. In scaling law experiments, block-based double decoders strongly outperform encoder-decoders and closely track decoder-only models across scales. At inference time, they cut KV-cache memory and per-token compute by at least 2/3 without sacrificing prefill caching or other existing inference optimizations available to decoder-only models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.