Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores

Chunshu Wu; Ruibing Song; Sushant Kondguli; Tong Geng; Ang Li

arXiv:2601.11660·cs.CV·January 21, 2026

Zeros can be Informative: Masked Binary U-Net for Image Segmentation on Tensor Cores

Chunshu Wu, Ruibing Song, Sushant Kondguli, Tong Geng, Ang Li

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Masked Binary U-Net, a hardware-efficient binary neural network for real-time high-resolution image segmentation that maintains near full-precision accuracy while significantly improving speed and energy efficiency on GPUs.

Contribution

The paper proposes a novel masked binary U-Net architecture and a GPU implementation that together achieve high accuracy and efficiency for real-time image segmentation on resource-constrained devices.

Findings

01

Near full-precision accuracy with only 3% average accuracy drop.

02

Over 2x speedup compared to 16-bit floating point U-Net.

03

More than 3x energy reduction on GPUs.

Abstract

Real-time image segmentation is a key enabler for AR/VR, robotics, drones, and autonomous systems, where tight accuracy, latency, and energy budgets must be met on resource-constrained edge devices. While U-Net offers a favorable balance of accuracy and efficiency compared to large transformer-based models, achieving real-time performance on high-resolution input remains challenging due to compute, memory, and power limits. Extreme quantization, particularly binary networks, is appealing for its hardware-friendly operations. However, two obstacles limit practicality: (1) severe accuracy degradation, and (2) a lack of end-to-end implementations that deliver efficiency on general-purpose GPUs. We make two empirical observations that guide our design. (1) An explicit zero state is essential: training with zero masking to binary U-Net weights yields noticeable sparsity. (2) Quantization…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

* This paper is well organized and easy to follow, especially for readers who are not familiar with this area. * The insights that introducing large amount of ‘zero state’ into pure binary U-Nets is extremely helpful on segmentation task is great * A major strength of this work is its practical GPU execution framework, which provides tangible, measurable speedup on widely available NVIDIA GPUs * Innovatively unlocks Hardware Potential with “Subtractive Bit-Encoding”, extending the BMMA (Binary m

Weaknesses

* Introducing ‘zero state’ into pure binary U-Net can significantly boost performance on segmentation, the performance of such a method applied on other types of networks and tasks remains unclear. However, the paper’s innovation on the low-level hardware implementation is still very solid.

Reviewer 02Rating 2Confidence 3

Strengths

S1 - Interesting finding that including zero-mask values seems to preserve the performance of the UNet model. Moreover, an interesting tensor-core deployment scheme was shown. S2 - The proposed system performs roughly just as well as the full-precision UNet’s while being much faster.

Weaknesses

W1 - While INT8 and INT4 models are evaluated for performance (in 4.3), their latency / speed / energy is not shown. Could it be that INT8 and INT4 perform just as well as the proposed method in terms of speed? W2- Authors should’ve compared to another simple baseline, namely using TensorRT for inference optimization / quantization as in https://arxiv.org/pdf/2012.12259. W3 - The paper is lacking in experimental results. For example, Section 4.4 summarizes insights already gathered in 4.3 an

Reviewer 03Rating 6Confidence 4

Strengths

1. The paper addresses real-time, high-resolution segmentation on edge devices, focusing on accuracy, latency, and energy. The empirical insights are straightforward and motivate a practical design, making the contribution both coherent and relevant. 2. The work combines masked binary weights with a cost-aware layer selection and a GPU execution framework using Tensor Cores. The subtractive bit-encoding and native binary operations show strong engineering rigor and enable deployable efficiency g

Weaknesses

1. The method adds ternary weights to selected layers of a binary U-Net to approach full-precision accuracy with near-binary efficiency. However, the paper does not clearly compare against ternary quantization baselines. Could the authors clarify in which dimensions MBU-Net outperforms classical ternary methods? 2. The paper enhances binary networks by masking some weights to zero. Intuitively, This seems related to sparsity/pruning. Could the authors clarify: can this method be considered as a

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Stochastic Gradient Optimization Techniques · Parallel Computing and Optimization Techniques