Are Large Language Models Economically Viable for Industry Deployment?
Abdullah Mohammad, Sushant Kumar Ray, Pushkar Arora, Rafiq Ali, Ebad Shabbir, Gautam Siddharth Kashyap, Jiechao Gao, Usman Naseem

TL;DR
This paper introduces EDGE-EVAL, a benchmarking framework evaluating LLMs on operational and economic metrics for industry deployment, revealing efficiency frontiers and challenges in quantization.
Contribution
It presents a comprehensive industry-oriented benchmarking framework with new deployment metrics, addressing the gap in economic evaluation of LLMs in real-world settings.
Findings
Models under 2B parameters outperform larger ones economically and ecologically.
LLaMA-3.2-1B (INT4) achieves ROI break-even in median 14 requests.
Quantization-aware training may increase adaptation energy, challenging assumptions.
Abstract
Generative AI-powered by Large Language Models (LLMs)-is increasingly deployed in industry across healthcare decision support, financial analytics, enterprise retrieval, and conversational automation, where reliability, efficiency, and cost control are critical. In such settings, models must satisfy strict constraints on energy, latency, and hardware utilization-not accuracy alone. Yet prevailing evaluation pipelines remain accuracy-centric, creating a Deployment-Evaluation Gap-the absence of operational and economic criteria in model assessment. To address this gap, we present EDGE-EVAL-a industry-oriented benchmarking framework that evaluates LLMs across their full lifecycle on legacy NVIDIA Tesla T4 GPUs. Benchmarking LLaMA and Qwen variants across three industrial tasks, we introduce five deployment metrics-Economic Break-Even (Nbreak), Intelligence-Per-Watt (IPW ), System Density…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
