Floe: Federated Specialization for Real-Time LLM-SLM Inference
Chunlin Tian, Kahou Tam, Yebo Wu, Shuaihang Zhong, Li Li, Nicholas D. Lane, Chengzhong Xu

TL;DR
Floe is a federated learning framework that combines cloud-based LLMs with lightweight edge models to enable real-time, privacy-preserving inference in resource-constrained environments, improving latency and personalization.
Contribution
Floe introduces a hybrid federated approach with heterogeneity-aware adaptation and logit fusion for efficient, privacy-preserving LLM inference on edge devices.
Findings
Reduces inference latency significantly compared to baselines.
Enhances user privacy and personalization.
Improves model performance on edge devices.
Abstract
Deploying large language models (LLMs) in real-time systems remains challenging due to their substantial computational demands and privacy concerns. We propose Floe, a hybrid federated learning framework designed for latency-sensitive, resource-constrained environments. Floe combines a cloud-based black-box LLM with lightweight small language models (SLMs) on edge devices to enable low-latency, privacy-preserving inference. Personal data and fine-tuning remain on-device, while the cloud LLM contributes general knowledge without exposing proprietary weights. A heterogeneity-aware LoRA adaptation strategy enables efficient edge deployment across diverse hardware, and a logit-level fusion mechanism enables real-time coordination between edge and cloud models. Extensive experiments demonstrate that Floe enhances user privacy and personalization. Moreover, it significantly improves model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Big Data and Digital Economy · Machine Learning in Healthcare
