Context-Aware CodeLLM Eviction for AI-assisted Coding

Kishanthan Thangarajah; Boyuan Chen; Shi Chang; Ahmed E. Hassan

arXiv:2506.18796·cs.SE·June 24, 2025

Context-Aware CodeLLM Eviction for AI-assisted Coding

Kishanthan Thangarajah, Boyuan Chen, Shi Chang, Ahmed E. Hassan

PDF

TL;DR

This paper introduces CACE, a context-aware eviction strategy for self-hosted CodeLLMs that improves latency and resource efficiency by considering multiple factors beyond recency, tailored for AI-assisted coding workflows.

Contribution

The paper proposes CACE, a novel multi-factor eviction algorithm that enhances model management for self-hosted CodeLLMs under resource constraints, outperforming traditional methods.

Findings

01

CACE reduces latency and model eviction frequency.

02

Multi-factor eviction balances responsiveness and efficiency.

03

Experimental results outperform state-of-the-art systems.

Abstract

AI-assisted coding tools powered by Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development workflows. To address concerns around privacy, latency, and model customization, many enterprises opt to self-host these models. However, the diversity and growing number of CodeLLMs, coupled with limited accelerator memory, introduce practical challenges in model management and serving efficiency. This paper presents CACE, a novel context-aware model eviction strategy designed specifically to optimize self-hosted CodeLLM serving under resource constraints. Unlike traditional eviction strategies based solely on recency (e.g., Least Recently Used), CACE leverages multiple context-aware factors, including model load time, task-specific latency sensitivity, expected output length, and recent usage and future demand tracked through a sliding window. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.