Multi-task Learning with 3D-Aware Regularization

Wei-Hong Li; Steven McDonagh; Ales Leonardis; Hakan Bilen

arXiv:2310.00986·cs.CV·October 3, 2023

Multi-task Learning with 3D-Aware Regularization

Wei-Hong Li, Steven McDonagh, Ales Leonardis, Hakan Bilen

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a 3D-aware regularizer for multi-task learning in computer vision, improving task correlation modeling by projecting features into a shared 3D space and enhancing performance across multiple benchmarks.

Contribution

The paper proposes a novel 3D-aware regularizer that interfaces multiple tasks via a shared 3D feature space, improving multi-task learning performance.

Findings

01

Improves multi-task learning performance on NYUv2 and PASCAL-Context.

02

Architecture-agnostic method adaptable to various backbones.

03

Enhances correlation modeling between tasks through 3D regularization.

Abstract

Deep neural networks have become a standard building block for designing models that can perform multiple dense computer vision tasks such as depth estimation and semantic segmentation thanks to their ability to capture complex correlations in high dimensional feature space across tasks. However, the cross-task correlations that are learned in the unstructured feature space can be extremely noisy and susceptible to overfitting, consequently hurting performance. We propose to address this problem by introducing a structured 3D-aware regularizer which interfaces multiple tasks through the projection of features extracted from an image encoder to a shared 3D feature space and decodes them into their task output space through differentiable rendering. We show that the proposed method is architecture agnostic and can be plugged into various prior multi-task backbones to improve their…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

(1) Interesting idea, the auxilary 3D aware branch forces the backbone to be truly 3D-aware. This method also improves the alignment between different prediction heads and the depth/density head as they will be rendered following the physical imaging process. (2) Using 3D aware representation as an auxiliary branch only for regularization also enables fast inference, which is a new and interesting idea to me.

Weaknesses

(1) The 3D aware branch uses Triplane representation and volumetric render, which needs a specific camera model (intrinsic matrix). So it is suspicious that the model can somehow overfit to the specific camera parameters. As a comparison, pixel-aligned scene representation (i.e., PiFU) can resolve this problem and it uses NDC representation. For scenes with different camera parameters, the performance is unclear. (2) Although the tri-plane representation has significantly reduced the memory co

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

Overall, I like this idea pretty much. For me, this work is like creating a new research line that makes most recognition methods 3D aware, which is dual to another research line that makes 2D image generation tasks 3D-aware. Although the introduced 3D-awared encoder does not explicitly give the 3D representation (I mean this method could not output decent 3D mesh), it gives way to investigating the 3D properties in most recognition tasks.

Weaknesses

I paid a lot of expectations on the experiments after reading the Abstract and Introduction sections, but the experiments were not strong enough to match my expectations.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

The paper is quite well-written and easy to follow. The main idea of the paper is clear - to output a 3D feature representation from an image (using K-planes), which can then be used as part of a NeRF-like neural rendering pipeline to project the feature into a 2D feature image which can then be used by a task-specific decoder.

Weaknesses

Predicting 3D representations (to use in neural rendering) from unlabeled 2D images and/or text is a relatively common idea [1, 2, 3, 4, 5]. The paper applies this idea in a multi-task setting, and adds a multi-view consistency for multi-view dataset. The results are not highly compelling compared to the baselines in the paper, which may internally learn some 3D structure too. So the explicit formulation of the 3D latent structure is not well motivated, unless the task requires some view editing

Code & Models

Repositories

vico-uoe/mtpsl
pytorchOfficial

Videos

Multi-task Learning with 3D-Aware Regularization· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Human Pose and Action Recognition · Advanced Neural Network Applications