Y-MAP-Net: Real-time depth, normals, segmentation, multi-label   captioning and 2D human pose in RGB images

Ammar Qammaz; Nikolaos Vasilikopoulos; Iason Oikonomidis; Antonis; A. Argyros

arXiv:2411.10334·cs.CV·November 18, 2024

Y-MAP-Net: Real-time depth, normals, segmentation, multi-label captioning and 2D human pose in RGB images

Ammar Qammaz, Nikolaos Vasilikopoulos, Iason Oikonomidis, Antonis, A. Argyros

PDF

Open Access

TL;DR

Y-MAP-Net is a lightweight, real-time neural network that simultaneously predicts depth, normals, human pose, segmentation, and captions from RGB images, leveraging multi-teacher training for efficient multi-task learning.

Contribution

It introduces a novel Y-shaped architecture trained via multi-teacher supervision to perform multiple tasks efficiently in real-time from a single RGB image.

Findings

01

Achieves real-time performance with multi-task predictions.

02

Demonstrates strong generalization and computational efficiency.

03

Supports practical robotics applications.

Abstract

We present Y-MAP-Net, a Y-shaped neural network architecture designed for real-time multi-task learning on RGB images. Y-MAP-Net, simultaneously predicts depth, surface normals, human pose, semantic segmentation and generates multi-label captions, all from a single network evaluation. To achieve this, we adopt a multi-teacher, single-student training paradigm, where task-specific foundation models supervise the network's learning, enabling it to distill their capabilities into a lightweight architecture suitable for real-time applications. Y-MAP-Net, exhibits strong generalization, simplicity and computational efficiency, making it ideal for robotics and other practical scenarios. To support future research, we will release our code publicly.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Human Pose and Action Recognition · Image and Object Detection Techniques

MethodsADaptive gradient method with the OPTimal convergence rate