MTMamba: Enhancing Multi-Task Dense Scene Understanding by Mamba-Based Decoders
Baijiong Lin, Weisen Jiang, Pengguang Chen, Yu Zhang, Shu Liu, and, Ying-Cong Chen

TL;DR
MTMamba introduces a Mamba-based architecture with specialized blocks to improve multi-task dense scene understanding by modeling long-range dependencies and cross-task interactions, leading to superior performance on benchmark datasets.
Contribution
The paper proposes MTMamba, a novel Mamba-based architecture with self-task and cross-task blocks for enhanced multi-task dense scene understanding.
Findings
Outperforms Transformer-based and CNN-based methods on NYUDv2 and PASCAL-Context.
Achieves +2.08, +5.01, +4.90 improvements in semantic segmentation, human parsing, and object boundary detection.
Demonstrates effective modeling of long-range dependencies and task interactions.
Abstract
Multi-task dense scene understanding, which learns a model for multiple dense prediction tasks, has a wide range of application scenarios. Modeling long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba, a novel Mamba-based architecture for multi-task scene understanding. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging Mamba, while CTM explicitly models task interactions to facilitate information exchange across tasks. Experiments on NYUDv2 and PASCAL-Context datasets demonstrate the superior performance of MTMamba over Transformer-based and CNN-based methods. Notably, on the PASCAL-Context dataset, MTMamba achieves improvements of +2.08, +5.01, and +4.90 over the previous best methods in the tasks of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Advanced Vision and Imaging
