Y-MAP-Net: Real-time depth, normals, segmentation, multi-label captioning and 2D human pose in RGB images
Ammar Qammaz, Nikolaos Vasilikopoulos, Iason Oikonomidis, Antonis, A. Argyros

TL;DR
Y-MAP-Net is a lightweight, real-time neural network that simultaneously predicts depth, normals, human pose, segmentation, and captions from RGB images, leveraging multi-teacher training for efficient multi-task learning.
Contribution
It introduces a novel Y-shaped architecture trained via multi-teacher supervision to perform multiple tasks efficiently in real-time from a single RGB image.
Findings
Achieves real-time performance with multi-task predictions.
Demonstrates strong generalization and computational efficiency.
Supports practical robotics applications.
Abstract
We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi-label captions, all from a single network evaluation. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the network's learning, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications. Y-MAP-Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIndustrial Vision Systems and Defect Detection · Human Pose and Action Recognition · Image and Object Detection Techniques
MethodsADaptive gradient method with the OPTimal convergence rate
