Context-Aware CodeLLM Eviction for AI-assisted Coding
Kishanthan Thangarajah, Boyuan Chen, Shi Chang, Ahmed E. Hassan

TL;DR
This paper introduces CACE, a context-aware eviction strategy for self-hosted CodeLLMs that improves latency and resource efficiency by considering multiple factors beyond recency, tailored for AI-assisted coding workflows.
Contribution
The paper proposes CACE, a novel multi-factor eviction algorithm that enhances model management for self-hosted CodeLLMs under resource constraints, outperforming traditional methods.
Findings
CACE reduces latency and model eviction frequency.
Multi-factor eviction balances responsiveness and efficiency.
Experimental results outperform state-of-the-art systems.
Abstract
AI-assisted coding tools powered by Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development workflows. To address concerns around privacy, latency, and model customization, many enterprises opt to self-host these models. However, the diversity and growing number of CodeLLMs, coupled with limited accelerator memory, introduce practical challenges in model management and serving efficiency. This paper presents CACE, a novel context-aware model eviction strategy designed specifically to optimize self-hosted CodeLLM serving under resource constraints. Unlike traditional eviction strategies based solely on recency (e.g., Least Recently Used), CACE leverages multiple context-aware factors, including model load time, task-specific latency sensitivity, expected output length, and recent usage and future demand tracked through a sliding window. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
