TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference

Konstantinos Papaioannou; Thaleia Dimitra Doudali

arXiv:2603.26498·cs.DC·May 6, 2026

TCM-Serve: Modality-aware Scheduling for Multimodal Large Language Model Inference

Konstantinos Papaioannou, Thaleia Dimitra Doudali

PDF

TL;DR

This paper introduces TCM-Serve, a modality-aware scheduler for multimodal large language models that significantly reduces latency and improves responsiveness by classifying and prioritizing different request types.

Contribution

The paper presents TCM-Serve, a novel scheduling system that dynamically classifies and prioritizes multimodal requests to optimize inference performance.

Findings

01

Reduces time-to-first-token by 54% overall.

02

Achieves 78.5% reduction in latency for critical requests.

03

Improves responsiveness of multimodal LLMs significantly.

Abstract

Multimodal Large Language Models (MLLMs) power platforms like ChatGPT, Gemini, and Copilot, enabling richer interactions with text, images, and videos. These heterogeneous workloads introduce additional inference stages, such as vision preprocessing and encoding, that inflate latency and memory demand. Existing LLM serving systems, optimized for text-only workloads, fail under multimodality: large requests (e.g., videos) monopolize resources, causing severe head-of-line blocking and performance degradation. Our key insight is that multimodal requests differ by orders of magnitude in resource demands, which we capture through a simple abstraction: videos behave like trucks, images like cars, and text like motorcycles. We design TCM-Serve, a modality-aware scheduler that lets motorcycles flow quickly through cars and trucks, ensuring interactive responsiveness while avoiding starvation.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.