Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Athos Georgiou

arXiv:2603.10031·cs.AR·March 12, 2026

Architecture-Aware LLM Inference Optimization on AMD Instinct GPUs: A Comprehensive Benchmark and Deployment Study

Athos Georgiou

PDF

Open Access

TL;DR

This paper evaluates and optimizes large language model inference on AMD Instinct GPUs, highlighting architecture-aware techniques, runtime requirements, and performance benchmarks across diverse models and workloads.

Contribution

It provides a comprehensive benchmark and deployment analysis of LLM inference on AMD GPUs, emphasizing architecture-specific optimizations and runtime configurations.

Findings

01

Architecture-aware optimization is crucial for performance.

02

MLA models require specific block sizes and cannot offload KV cache.

03

Inference throughput is limited by memory bandwidth at high concurrency.

Abstract

We present a cross-architecture evaluation of production LLM inference on AMD Instinct MI325X GPUs, benchmarking four models spanning 235B to 1 trillion parameters across three architectural families (MoE+MLA, Dense+GQA, MoE+GQA) on an 8-GPU cluster with 2TB aggregate HBM3e using vLLM v0.14.1. Our results demonstrate that architecture-aware optimization is essential: MLA models require block size 1 and cannot use KV cache offloading, while GQA models benefit from both. The AMD AITER runtime is required for competitive MLA inference throughput and must be selectively disabled for architectures with incompatible attention head configurations. A controlled AITER ablation on Llama-3.1-405B (n=5 per condition) reveals a modest 3-5% throughput benefit at high concurrency but 2-16x higher measurement variability, confirming that AITER's large speedups target MoE/MLA kernels specifically. Under…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsParallel Computing and Optimization Techniques · Advanced Neural Network Applications · Big Data and Digital Economy