AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies

Amit Sharma

arXiv:2506.00008·cs.AR·June 10, 2025

AI Accelerators for Large Language Model Inference: Architecture Analysis and Scaling Strategies

Amit Sharma

PDF

Open Access

TL;DR

This paper provides a comprehensive performance analysis of various AI accelerators for large language model inference, highlighting key architectural differences and scaling strategies to optimize performance.

Contribution

It presents the first workload-centric, cross-architectural performance comparison of commercial AI accelerators for LLM inference, offering insights into scaling techniques and architectural gaps.

Findings

01

Performance varies up to 3.7x across architectures with batch size and sequence length.

02

Expert parallelism achieves 8.4x parameter-to-compute advantage but with higher latency variance.

03

Guides matching workloads to suitable accelerators and identifies areas for architectural improvement.

Abstract

The rapid growth of large-language models (LLMs) is driving a new wave of specialized hardware for inference. This paper presents the first workload-centric, cross-architectural performance study of commercial AI accelerators, spanning GPU-based chips, hybrid packages, and wafer-scale engines. We compare memory hierarchies, compute fabrics, and on-chip interconnects, and observe up to 3.7x performance variation across architectures as batch size and sequence length change. Four scaling techniques for trillion-parameter models are examined; expert parallelism offers an 8.4x parameter-to-compute advantage but incurs 2.1x higher latency variance than tensor parallelism. These findings provide quantitative guidance for matching workloads to accelerators and reveal architectural gaps that next-generation designs must address.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Machine Learning in Materials Science · Advanced Neural Network Applications