Federated Inference for Heterogeneous LLM Communication and Collaboration

Zihan Chen; Zeshen Li; Howard H. Yang; Tony Q.S. Quek; Jihong Park

arXiv:2603.28772·cs.DC·April 1, 2026

Federated Inference for Heterogeneous LLM Communication and Collaboration

Zihan Chen, Zeshen Li, Howard H. Yang, Tony Q.S. Quek, Jihong Park

PDF

TL;DR

This paper introduces FedRefine, a federated inference framework enabling heterogeneous LLMs to collaborate efficiently and privately through KV cache communication, enhancing on-device inference performance.

Contribution

The paper proposes a novel federated inference paradigm for heterogeneous LLMs that preserves privacy and improves collaborative inference efficiency.

Findings

01

Numerical results demonstrate the superiority of FedRefine.

02

Highlights of future research topics are provided.

03

A new paradigm for LLM-native communication is explored.

Abstract

Given the limited performance and efficiency of on-device Large Language Models (LLMs), the collaborations between multiple LLMs enable desirable performance enhancements, in which data, tokens, and model weights could be shared across LLMs. This process is constrained by task-oriented QoS demands, privacy requirements, and inherent system heterogeneity. In view of the above challenge and to fully exploit the on-device inference capabilities, we present a novel federated inference framework in this position paper, termed federated refinement \texttt{FedRefine}. This framework presents a new paradigm for heterogeneous LLMs collaboratively performing inference with communicating KV caches in a privacy-preserving manner. Some numerical results are provided to highlight the superiority of \texttt{FedRefine}. Several interesting topics are also highlighted for future research. By exploring…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.