Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading

Hanzhen Lu; Lishui Fan; Jiachi Chen; Qiuyuan Chen; Zhao Wei; Zhongxin Liu

arXiv:2603.05974·cs.SE·March 10, 2026

Balancing Latency and Accuracy of Code Completion via Local-Cloud Model Cascading

Hanzhen Lu, Lishui Fan, Jiachi Chen, Qiuyuan Chen, Zhao Wei, Zhongxin Liu

PDF

Open Access

TL;DR

This paper introduces MCCom, a framework that cascades a small local model with a cloud-based large language model for code completion, balancing latency and accuracy by selectively invoking the cloud model based on user actions.

Contribution

The paper presents a novel cascading framework with user-action-based triggers, a two-stage speculative decoding, and a lightweight model, improving efficiency and accuracy in code completion.

Findings

01

Reduces inference latency by up to 47.9%.

02

Cuts cloud LLM usage by 46.3%.

03

Improves LLM exact match rate by 8.9%.

Abstract

Line-level code completion requires a critical balance between high accuracy and low latency. Existing methods suffer from a trade-off: large language models (LLMs) provide high-quality suggestions but incur high latency, while small language models (SLMs) are fast but often suboptimal. We propose MCCom (Model-Cascading-based code Completion), a framework that cascades a local SLM with a cloud-based LLM. To achieve effective cascading, MCCom leverages user actions as a novel signal to trigger the LLM only when the SLM fails, significantly reducing cloud computation costs. Furthermore, we introduce a two-stage speculative decoding strategy and an iterative retrieval mechanism to enhance collaboration between the models. We also train a 121M-parameter lightweight model, which achieves 73.8% of the performance of a 7B state-of-the-art model. Evaluated on RepoEval and a new real-world…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Advanced Neural Network Applications