PHOTON: Hierarchical Autoregressive Modeling for Lightspeed and Memory-Efficient Language Generation
Yuma Ichikawa, Naoya Takagi, Takumi Nakagawa, Yuzi Kanazawa, Akira Sakai

TL;DR
PHOTON introduces a hierarchical autoregressive model that significantly improves language generation speed and memory efficiency by replacing horizontal token scanning with vertical multi-resolution context scanning, especially benefiting long-context tasks.
Contribution
The paper presents PHOTON, a novel hierarchical autoregressive model that reduces inference latency and memory usage by using vertical context scanning and recursive generation, outperforming traditional transformers.
Findings
Up to 1000x higher throughput per unit memory.
Superior long-context and multi-query task performance.
Reduces decode-time KV-cache traffic.
Abstract
Transformers operate as horizontal token-by-token scanners; at each generation step, attending to an ever-growing sequence of token-level states. This access pattern increases prefill latency and makes long-context decoding more memory-bound, as KV-cache reads and writes dominate inference time over arithmetic operations. We propose Parallel Hierarchical Operation for TOp-down Networks (PHOTON), a hierarchical autoregressive model that replaces horizontal scanning with vertical, multi-resolution context scanning. PHOTON maintains a hierarchy of latent streams: a bottom-up encoder compresses tokens into low-rate contextual states, while lightweight top-down decoders reconstruct fine-grained token representations in parallel. We further introduce recursive generation that updates only the coarsest latent stream and eliminates bottom-up re-encoding. Experimental results show that PHOTON is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Advanced Neural Network Applications · Caching and Content Delivery
