PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation

Yuma Ichikawa; Naoya Takagi; Takumi Nakagawa; Yuzi Kanazawa; Akira Sakai

arXiv:2512.20687·cs.LG·January 9, 2026

PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation

Yuma Ichikawa, Naoya Takagi, Takumi Nakagawa, Yuzi Kanazawa, Akira Sakai

PDF

Open Access

TL;DR

PHOTON introduces a hierarchical autoregressive model that significantly improves language generation speed and memory efficiency by replacing horizontal token scanning with vertical multi-resolution context scanning, especially benefiting long-context tasks.

Contribution

The paper presents PHOTON, a novel hierarchical autoregressive model that reduces inference latency and memory usage by using vertical context scanning and recursive generation, outperforming traditional transformers.

Findings

01

Up to 1000x higher throughput per unit memory.

02

Superior long-context and multi-query task performance.

03

Reduces decode-time KV-cache traffic.

Abstract

Transformers operate as horizontal token-by-token scanners; at each generation step, attending to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding more memory-bound, as KV-cache reads and writes dominate inference time over arithmetic operations. We propose Parallel Hierarchical Operation for TOp-down Networks (PHOTON), a hierarchical autoregressive model that replaces horizontal scanning with vertical, multi-resolution context scanning. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations in parallel. We further introduce recursive generation that updates only the coarsest latent stream and eliminates bottom-up re-encoding. Experimental results show that PHOTON is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Advanced Neural Network Applications · Caching and Content Delivery