HybridFlow: Resource-Adaptive Subtask Routing for Efficient Edge-Cloud LLM Inference
Jiangwen Dong, Jiayu Li, Tianhang Zheng, Wanyu Lin

TL;DR
HybridFlow is a resource-adaptive framework for edge-cloud LLM inference that dynamically routes subtasks based on dependencies and utility models, improving efficiency and latency.
Contribution
It introduces a dependency-aware DAG and a learned benefit-cost utility model for dynamic, parallel subtask execution and routing in edge-cloud LLM inference.
Findings
Reduces latency and cloud API usage in multiple benchmarks.
Maintains competitive reasoning accuracy with structured baselines.
Improves cost-accuracy trade-off in edge-cloud inference.
Abstract
Edge-cloud collaborative inference is becoming a practical necessity for LLM-powered edge devices: on-device models often cannot afford the required reasoning capability, while cloud-only inference could be prohibitively costly and slow under strict latency and token/API budgets. However, existing edge-cloud collaboration methods often route per query or fixed steps simply based-on the estimated difficulty. Such coarse and static heuristics overlook subtask dependencies, missing opportunities for parallel execution and budget-adaptive routing. To this end, we propose \textbf{HybridFlow}, a resource-adaptive edge-cloud inference framework that (i) builds a dependency-aware DAG for each query and executes newly unlocked subtasks in parallel, reducing end-to-end latency; (ii) routes each subtask online to the edge or cloud via a learned benefit--cost utility model that dynamically trades…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Big Data and Digital Economy · Advanced Neural Network Applications
