Simplifying Multi-Task Architectures Through Task-Specific Normalization
Mihai Suteu, Ovidiu Serban

TL;DR
This paper demonstrates that task-specific normalization layers, especially the proposed TSσBN, can effectively address multi-task learning challenges, reducing complexity while maintaining or improving performance across various benchmarks.
Contribution
The paper introduces TSσBN, a lightweight task-specific normalization method that simplifies multi-task architectures and provides interpretability of task relationships.
Findings
TSσBN matches or exceeds state-of-the-art performance on multiple benchmarks.
Task-specific normalization reduces architectural complexity and overhead.
Learned gates in TSσBN offer insights into task capacity and filter specialization.
Abstract
Multi-task learning (MTL) aims to leverage shared knowledge across tasks to improve generalization and parameter efficiency, yet balancing resources and mitigating interference remain open challenges. Architectural solutions often introduce elaborate task-specific modules or routing schemes, increasing complexity and overhead. In this work, we show that normalization layers alone are sufficient to address many of these challenges. Simply replacing shared normalization with task-specific variants already yields competitive performance, questioning the need for complex designs. Building on this insight, we propose Task-Specific Sigmoid Batch Normalization (TSBN), a lightweight mechanism that enables tasks to softly allocate network capacity while fully sharing feature extractors. TSBN improves stability across CNNs and Transformers, matching or exceeding performance on…
Peer Reviews
Decision·Submitted to ICLR 2026
- The idea to make only normalization task-specific, without extra attention/routing, is simpler than works like Cross-Stitchwhich add task branches or dynamic sharing, so using $\sigma$-BN from (Suteu & Guo, 2022) for MTL is a nice, underexplored reuse. - The reviewer found it interesting that TS-$\sigma$-BN stays stable even when they boost BN learning rates to 10² (Figure 6), while plain TSBN collapses, so the bounded gate actually matters. - Writing is mostly clear.
- It should have been easy to show TSσBN vs MTAN vs MoE on CelebA with 40 tasks under the LibMTL setup (authors only report LibMTL for NYUv2/Cityscapes), and an ablation against simpler DSBN/TaskNorm baselines is missing. - The reviewer questions the novelty of the contributions w.r.t. ICLR standards. - Typos - Line 289: “respresentative” → “representative”
- The paper is clearly written and easy to understand. - The paper addresses a persistent problem of negative interference in MTL. Finding efficient methods to mitigate this issue is of high importance to the field. The proposed method is simple and practical. It is also very straightforward for integration in practical settings. - The ablations and analyses on task-filter importance analysis (Figure 4) are intuitive and match expectations.
This paper suffers from significant limitations in its current form, primarily concerning novelty, the depth of its empirical evaluation, and its positioning within the current literature. In its current form, this work extremely incremental and the core of the contributions around task-specific batch norms have already been established for 5< years. **Limited Novelty:** The central contribution of this paper concerns using task-specific parameters or statistics in BatchNorm for MTL. Task-speci
- The capacity TS$\sigma$BN to built an importance matrix is a highly appreciated feature of the method. It allows interpretable insights into the model behaviour. - TS$\sigma$BN robustness to loss scales without additional changes is another strength. - The idea is quite simple but is an important contribution to MTL. The concept of domain-specific batch layer norm while not new for MTL show surprising positive effects. Besides that the TS$\sigma$BN is easy to implement and require minimal cha
- While the authors focused quite extensively on vision tasks, it is unclear whether TS$\sigma$BN generalizes beyond computer vision. Small experiments on NLP or time series MTL (MIMIC-III) tasks would strength the paper claim on generalisation of the method. - The paper provides mainly empirical evidences. A deeper theoretical justification of why task-specific batch normalization disentagle representations could make the work more compelling. - Another aspect of MTL methods is its hyper-param
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Software-Defined Networks and 5G · Advanced Graph Neural Networks
