Optical Transformers
Maxwell G. Anderson, Shi-Yuan Ma, Tianyu Wang, Logan G. Wright, Peter, L. McMahon

TL;DR
This paper explores the potential of optical hardware to perform large Transformer model computations more energy-efficiently than digital systems, demonstrating promising experimental results and theoretical scaling laws.
Contribution
The study provides the first small-scale optical experiments for Transformer operations and develops scaling laws predicting significant energy efficiency gains with large-scale optical hardware.
Findings
Optical implementations can run Transformer operations despite noise and errors.
Energy per MAC scales as 1/d, offering asymptotic advantages over digital systems.
Potential for over 8,000x energy efficiency in large-scale optical Transformers.
Abstract
The rapidly increasing size of deep-learning models has caused renewed and growing interest in alternatives to digital computers to dramatically reduce the energy cost of running state-of-the-art neural networks. Optical matrix-vector multipliers are best suited to performing computations with very large operands, which suggests that large Transformer models could be a good target for optical computing. To test this idea, we performed small-scale optical experiments with a prototype accelerator to demonstrate that Transformer operations can run on optical hardware despite noise and errors. Using simulations, validated by our experiments, we then explored the energy efficiency of optical implementations of Transformers and identified scaling laws for model performance with respect to optical energy usage. We found that the optical energy per multiply-accumulate (MAC) scales as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Reservoir Computing · Photonic and Optical Devices · Optical Network Technologies
MethodsMulti-Head Attention · Attention Is All You Need · Test · Linear Layer · Label Smoothing · Dense Connections · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Dropout
