GAR: Carbon-Aware Routing for LLM Inference via Constrained Optimization
Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta

TL;DR
This paper introduces Green-Aware Routing (GAR), a multi-objective optimization framework for LLM inference that reduces CO2 emissions while maintaining accuracy and latency constraints.
Contribution
GAR is a novel constrained optimization approach that incorporates real-time carbon-aware routing for LLMs, balancing sustainability with performance.
Findings
GAR reduces CO2 emissions significantly across NLP benchmarks.
GAR maintains accuracy and latency within specified service-level objectives.
The online primal-dual algorithm effectively manages rolling carbon budgets.
Abstract
The growing deployment of large language models (LLMs) makes per-request routing essential for balancing response quality and computational cost across heterogeneous model pools. Current routing methods rarely consider sustainable energy use and CO2 emissions as optimization objectives, despite grid carbon intensity varying by time and region, and models differing significantly in energy consumption. To address this gap, we introduce Green-Aware Routing (GAR), a constrained multi-objective optimization framework that minimizes per-request CO2 emissions subject to explicit accuracy floors and p95-latency service-level objectives (SLOs). GAR employs adaptive constraint optimization through per-dataset floor tuning and incorporates lightweight estimators for correctness, tail latency, and carbon emissions, enabling real-time routing decisions without additional inference passes. We present…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
