Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

Liangqi Yuan; Dong-Jun Han; Shiqiang Wang; Christopher G. Brinton

arXiv:2502.11007·cs.LG·January 1, 2026

Local-Cloud Inference Offloading for LLMs in Multi-Modal, Multi-Task, Multi-Dialogue Settings

Liangqi Yuan, Dong-Jun Han, Shiqiang Wang, Christopher G. Brinton

PDF

Open Access 1 Repo

TL;DR

This paper introduces TMO, a system that intelligently offloads LLM inference between local and cloud resources for multi-modal, multi-task, multi-dialogue applications, optimizing for response quality, latency, and cost.

Contribution

The paper proposes TMO, a novel local-cloud inference system with a reinforcement learning strategy and a new dataset for evaluating multi-modal LLM offloading.

Findings

01

TMO significantly reduces latency and cost compared to baselines.

02

TMO improves response quality in multi-modal, multi-task, multi-dialogue settings.

03

The RCRL strategy effectively balances resource use and performance.

Abstract

Compared to traditional machine learning models, recent large language models (LLMs) can exhibit multi-task-solving capabilities through multiple dialogues and multi-modal data sources. These unique characteristics of LLMs, together with their large model size, make their deployment more challenging. Specifically, (i) deploying LLMs on local devices faces computational, memory, and energy resource issues, while (ii) deploying them in the cloud cannot guarantee real-time service and incurs communication/usage costs. In this paper, we design TMO, a local-cloud LLM inference system with Three-M Offloading: Multi-modal, Multi-task, and Multi-dialogue. TMO incorporates (i) a lightweight local LLM that can process simple tasks at high speed and (ii) a large-scale cloud LLM that can handle multi-modal data sources. We develop a resource-constrained reinforcement learning (RCRL) strategy for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

liangqiyuan/LCIO
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsService-Oriented Architecture and Web Services · IoT and Edge/Fog Computing · Robotics and Automated Systems

Methodstravel james · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings