MTMamba++: Enhancing Multi-Task Dense Scene Understanding via Mamba-Based Decoders
Baijiong Lin, Weisen Jiang, Pengguang Chen, Shu Liu, and Ying-Cong Chen

TL;DR
MTMamba++ introduces a novel multi-task dense scene understanding architecture with Mamba-based decoders, effectively capturing long-range dependencies and cross-task interactions, leading to superior performance across multiple datasets.
Contribution
The paper presents MTMamba++, a new architecture with Mamba-based decoders that explicitly model long-range dependencies and cross-task interactions for improved multi-task scene understanding.
Findings
Outperforms CNN, Transformer, and diffusion-based methods on NYUDv2, PASCAL-Context, and Cityscapes.
Effectively models long-range dependencies using state-space models.
Maintains high computational efficiency while achieving superior accuracy.
Abstract
Multi-task dense scene understanding, which trains a model for multiple dense prediction tasks, has a wide range of application scenarios. Capturing long-range dependency and enhancing cross-task interactions are crucial to multi-task dense prediction. In this paper, we propose MTMamba++, a novel architecture for multi-task scene understanding featuring with a Mamba-based decoder. It contains two types of core blocks: self-task Mamba (STM) block and cross-task Mamba (CTM) block. STM handles long-range dependency by leveraging state-space models, while CTM explicitly models task interactions to facilitate information exchange across tasks. We design two types of CTM block, namely F-CTM and S-CTM, to enhance cross-task interaction from feature and semantic perspectives, respectively. Extensive experiments on NYUDv2, PASCAL-Context, and Cityscapes datasets demonstrate the superior…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Advanced Neural Network Applications · Advanced Vision and Imaging
MethodsMamba: Linear-Time Sequence Modeling with Selective State Spaces
