DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

Kathiravan Palaniappan

arXiv:2604.14552·cs.PF·May 7, 2026

DEEP-GAP: Deep-learning Evaluation of Execution Parallelism in GPU Architectural Performance

Kathiravan Palaniappan

PDF

TL;DR

DEEP-GAP systematically evaluates GPU inference performance across T4 and L4 architectures, revealing how precision modes and batch sizes impact throughput and efficiency for modern AI workloads.

Contribution

This work extends the GDEV-AI methodology to GPU inference, providing empirical data comparing T4 and L4 GPUs under controlled conditions.

Findings

01

INT8 precision achieves up to 58x throughput over CPU baseline.

02

L4 GPU outperforms T4 by up to 4.4x in throughput.

03

Smaller batch sizes favor L4 for latency-sensitive tasks.

Abstract

Modern datacenters increasingly rely on low-power, single-slot inference accelerators to balance performance, energy efficiency, and rack density constraints. The NVIDIA T4 GPU has become widely deployed due to strong performance per watt and mature software support. Its successor, the NVIDIA L4 GPU, introduces improvements in Tensor Core throughput, cache capacity, memory bandwidth, and parallel execution capability. However, limited empirical evidence quantifies the practical inference performance gap between these two generations under controlled and reproducible conditions. This work introduces DEEP-GAP, a systematic evaluation extending the GDEV-AI methodology to GPU inference. Using identical configurations and workloads, we evaluate ResNet18, ResNet50, and ResNet101 across FP32, FP16, and INT8 precision modes using PyTorch and TensorRT. Results show that reduced precision…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.