DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors

Yi Li; Hongze Shen; Lexiang Tang; Xin Li; Xinpeng Ding; Yinsong Liu; Deqiang Jiang; Xing Sun; Xiaomeng Li

arXiv:2602.14134·cs.CV·February 17, 2026

DenseMLLM: Standard Multimodal LLMs are Intrinsic Dense Predictors

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li

PDF

Open Access

TL;DR

DenseMLLM demonstrates that standard multimodal large language models can be adapted for dense prediction tasks without additional decoders, maintaining high performance across various benchmarks.

Contribution

This work introduces DenseMLLM, a minimalist approach enabling standard MLLMs to perform dense predictions through a novel supervision strategy, eliminating the need for task-specific decoders.

Findings

01

Achieves competitive results on dense prediction benchmarks.

02

Supports multiple dense perception tasks without architectural modifications.

03

Maintains generalist design while handling fine-grained visual tasks.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis