Towards Optimizing the Costs of LLM Usage
Shivanshu Shekhar, Tanishq Dubey, Koyel Mukherjee, Apoorv Saxena,, Atharv Tyagi, Nishanth Kotla

TL;DR
This paper presents a cost-optimization framework for LLM usage that predicts output quality without invocation, enabling cost-effective selection and token reduction strategies, validated on diverse datasets.
Contribution
It introduces a novel quality prediction model and optimization algorithms for LLM selection and token reduction, improving cost-efficiency while maintaining quality.
Findings
Cost reduction of 40%-90% achieved
Quality improvement of 4%-7% demonstrated
Effective on enterprise and open-source datasets
Abstract
Generative AI and LLMs in particular are heavily used nowadays for various document processing tasks such as question answering and summarization. However, different LLMs come with different capabilities for different tasks as well as with different costs, tokenization, and latency. In fact, enterprises are already incurring huge costs of operating or using LLMs for their respective use cases. In this work, we propose optimizing the usage costs of LLMs by estimating their output quality (without actually invoking the LLMs), and then solving an optimization routine for the LLM selection to either keep costs under a budget, or minimize the costs, in a quality and latency aware manner. We propose a model to predict the output quality of LLMs on document processing tasks like summarization, followed by an LP rounding algorithm to optimize the selection of LLMs. We study optimization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScheduling and Optimization Algorithms · Manufacturing Process and Optimization
MethodsAttentive Walk-Aggregating Graph Neural Network
