M2H: Multi-Task Learning with Efficient Window-Based Cross-Task Attention for Monocular Spatial Perception
U.V.B.L Udugama, George Vosselman, Francesco Nex

TL;DR
M2H is an efficient multi-task learning framework using window-based cross-task attention for monocular spatial perception, enabling real-time multi-task predictions on edge devices.
Contribution
Introduces M2H, a novel multi-task learning model with structured feature exchange via window-based attention, optimized for real-time monocular spatial perception.
Findings
Outperforms state-of-the-art multi-task models on NYUDv2.
Surpasses single-task baselines on Hypersim.
Achieves superior performance on Cityscapes with efficiency.
Abstract
Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. This paper introduces Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the foundation for monocular spatial perception systems supporting 3D scene graph construction in dynamic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Robotics and Sensor-Based Localization · Multimodal Machine Learning Applications
