Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings
Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton

TL;DR
This paper introduces TMO, a system that intelligently offloads LLM inference between local and cloud resources for multi-modal, multi-task, multi-dialogue applications, optimizing for response quality, latency, and cost.
Contribution
The paper proposes TMO, a novel local-cloud inference system with a reinforcement learning strategy and a new dataset for evaluating multi-modal LLM offloading.
Findings
TMO significantly reduces latency and cost compared to baselines.
TMO improves response quality in multi-modal, multi-task, multi-dialogue settings.
The RCRL strategy effectively balances resource use and performance.
Abstract
Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsService-Oriented Architecture and Web Services · IoT and Edge/Fog Computing · Robotics and Automated Systems
Methodstravel james · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
