From LLM to Silicon: RL-Driven ASIC Architecture Exploration for On-Device AI Inference
Ravindra Ganti, Steve Xu

TL;DR
This paper introduces an RL-driven compiler that optimizes ASIC architecture, memory, and workload partitioning for AI inference across various process nodes, achieving automated adaptation and high performance.
Contribution
It presents a unified RL-based approach to jointly optimize ASIC design parameters for AI inference, reducing manual tuning across multiple process nodes.
Findings
Achieves high throughput of 29809 tokens/sec on Llama 3.1 8B FP16 at 3nm.
Maintains less than 13 mW power consumption on SmolVLM across nodes.
Automatically adapts architecture configurations without manual retuning.
Abstract
We present an RL-driven compiler that jointly optimizes ASIC architecture, memory hierarchy, and workload partitioning for AI inference across 3nm to 28nm. The design space is formulated as a single Markov Decision Process with mixed discrete-continuous actions and a unified Power-Performance-Area (PPA) objective. Soft Actor-Critic (SAC) with Mixture-of-Experts gating explores the joint space of mesh topology, per-core microarchitecture, and operator placement. We validate on two workloads, Llama 3.1 8B FP16 (high-performance mode, 29809 tokens per second at 3nm) and SmolVLM (low-power mode, less than 13 mW at all nodes, 10 MHz). Across 7 process nodes, the RL automatically adapts mesh sizes and per-tile configurations, including heterogeneous FETCH, VLEN, and memory allocation without node-specific manual retuning.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
