Federated Inference for Heterogeneous LLM Communication and Collaboration
Zihan Chen, Zeshen Li, Howard H. Yang, Tony Q.S. Quek, Jihong Park

TL;DR
This paper introduces FedRefine, a federated inference framework enabling heterogeneous LLMs to collaborate efficiently and privately through KV cache communication, enhancing on-device inference performance.
Contribution
The paper proposes a novel federated inference paradigm for heterogeneous LLMs that preserves privacy and improves collaborative inference efficiency.
Findings
Numerical results demonstrate the superiority of FedRefine.
Highlights of future research topics are provided.
A new paradigm for LLM-native communication is explored.
Abstract
Given the limited performance and efficiency of on-device Large Language Models (LLMs), the collaborations between multiple LLMs enable desirable performance enhancements, in which data, tokens, and model weights could be shared across LLMs. This process is constrained by task-oriented QoS demands, privacy requirements, and inherent system heterogeneity. In view of the above challenge and to fully exploit the on-device inference capabilities, we present a novel federated inference framework in this position paper, termed federated refinement \texttt{FedRefine}. This framework presents a new paradigm for heterogeneous LLMs collaboratively performing inference with communicating KV caches in a privacy-preserving manner. Some numerical results are provided to highlight the superiority of \texttt{FedRefine}. Several interesting topics are also highlighted for future research. By exploring…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
