GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Soutrik Mukherjee; Sangwhan Cha

arXiv:2603.28708·cs.LG·March 31, 2026

GPU-Accelerated Optimization of Transformer-Based Neural Networks for Real-Time Inference

Soutrik Mukherjee, Sangwhan Cha

PDF

TL;DR

This paper develops a GPU-accelerated, hybrid-precision inference pipeline for transformer models like BERT and GPT-2, achieving significant speedups, reduced memory, and maintained accuracy for real-time applications.

Contribution

It introduces a hybrid FP16/FP32 precision strategy that ensures numerical stability and high fidelity in GPU-accelerated transformer inference.

Findings

01

Up to 64.4x speedup over CPU baselines.

02

Sub-10 ms latency for single-sample inference.

03

No accuracy loss with hybrid precision on downstream tasks.

Abstract

This paper presents the design and evaluation of a GPU-accelerated inference pipeline for transformer models using NVIDIA TensorRT with mixed-precision optimization. We evaluate BERT-base (110M parameters) and GPT-2 (124M parameters) across batch sizes from 1 to 32 and sequence lengths from 32 to 512. The system achieves up to 64.4x speedup over CPU baselines, sub-10 ms latency for single-sample inference, and a 63 percent reduction in memory usage. We introduce a hybrid precision strategy that preserves FP32 for numerically sensitive operations such as softmax and layer normalization, while applying FP16 to linear layers. This approach maintains high numerical fidelity (cosine similarity >= 0.9998 relative to baseline outputs) and eliminates NaN instability. The pipeline is implemented as a modular, containerized system that enables reproducible benchmarking across more than 360…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.