PRISM: Breaking the O(n) Memory Wall in Long-Context LLM Inference via O(1) Photonic Block Selection
Hyoseok Park, Yeonsang Park

TL;DR
PRISM introduces a photonic-based method to drastically reduce memory bandwidth bottlenecks in long-context large language model inference, enabling O(1) block selection and significant energy savings.
Contribution
This work is the first to leverage photonic broadcast-and-weight paradigm for coarse block selection in long-context LLM inference, breaking the O(n) memory wall.
Findings
Achieves 100% accuracy from 4K to 64K tokens at k=32.
Reduces traffic by 16x at 64K context length.
Provides a four-order-of-magnitude energy advantage over GPU baselines.
Abstract
Long-context LLM inference is bottlenecked not by compute but by the O(n) memory bandwidth cost of scanning the KV cache at every decode step -- a wall that no amount of arithmetic scaling can break. Recent photonic accelerators have demonstrated impressive throughput for dense attention computation; however, these approaches inherit the same O(n) memory scaling as electronic attention when applied to long contexts. We observe that the real leverage point is the coarse block-selection step: a memory-bound similarity search that determines which KV blocks to fetch. We identify, for the first time, that this task is structurally matched to the photonic broadcast-and-weight paradigm -- the query fans out to all candidates via passive splitting, signatures are quasi-static (matching electro-optic MRR programming), and only rank order matters (relaxing precision to 4-6 bits). Crucially, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Optical Network Technologies · Photonic and Optical Devices
