TL;DR
Orthrus is a novel dual-architecture framework that combines the accuracy of autoregressive LLMs with the speed of diffusion models, enabling lossless, parallel token generation with significant speedup.
Contribution
It introduces a unified system integrating autoregressive and diffusion views in Transformers, achieving high-fidelity, parallel token generation with minimal overhead.
Findings
Up to 7.8x speedup in token generation
Exact consensus guarantees lossless inference
Minimal memory and parameter overhead
Abstract
We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
