SegRet: An Efficient Design for Semantic Segmentation with Retentive Network
Zhiyuan Li, Yi Chang, Yuan Wu

TL;DR
SegRet introduces a lightweight, efficient semantic segmentation model using Retentive Network architecture and a residual decoder, achieving state-of-the-art results with fewer parameters for autonomous driving applications.
Contribution
The paper presents SegRet, a novel semantic segmentation model combining RetNet backbone with a zero-initialized residual decoder for improved efficiency and performance.
Findings
Achieves state-of-the-art performance on ADE20K, Cityscapes, and COCO-Stuff.
Reduces model parameters significantly compared to existing methods.
Maintains high accuracy with lower computational cost.
Abstract
With the rapid evolution of autonomous driving technology and intelligent transportation systems, semantic segmentation has become increasingly critical. Precise interpretation and analysis of real-world environments are indispensable for these advanced applications. However, traditional semantic segmentation approaches frequently face challenges in balancing model performance with computational efficiency, especially regarding the volume of model parameters. To address these constraints, we propose SegRet, a novel model employing the Retentive Network (RetNet) architecture coupled with a lightweight residual decoder that integrates zero-initialization. SegRet offers three distinctive advantages: (1) Lightweight Residual Decoder: by embedding a zero-initialization layer within the residual network structure, the decoder remains computationally streamlined without sacrificing essential…
Peer Reviews
Decision·Submitted to ICLR 2026
1. This paper achieves a strong efficiency trade-off, particularly at smaller model sizes, with documented improvements over comparable baseline methods. 2. This paper provides transparent ablation studies on decoder design and input scaling, along with clear training details and code to ensure reproducibility. 3. This paper honestly addresses limitations and suggests plausible next steps, such as domain adaptation and attention-guided upsampling.
1. The novelty lies mainly in the minimalist decoder design; the use of RetNet as the encoder and residual fusion represents an incremental improvement rather than a conceptual breakthrough. 2. While results are competitive, they do not clearly establish state-of-the-art performance under similar computational constraints in the most challenging benchmarks. Stronger comparisons using identical training configurations would be beneficial.
The paper is well organized. The proposed methods achieves SOTA performance with low computational resources.
1. Lack of contribution. The proposed method's backbone is almost the same as the compared method RMT, which cannot be viewed as contribution. The improvement in the decoder is also minimum. 2. The compared methods contains mostly general vision backbones, and the result on many other vision tasks, such as object classification and detection, are available. But the proposed method is not comparing with them. 3. The comparison in Tab. 2, 3, 6, 7, 8 is not fair. The image size should keep the sa
Clean encoder–decoder story using Vision RetNet as a hierarchical backbone (Fig. 2, p. 4; Sec. 3.1) and a small decoder (Sec. 3.2, p. 5–6). Competitive tiny/small regime: SegRet-Tiny (≈14 M params) is strong vs. other “tiny” setups across datasets (Tables 1–3, pp. 7–8). Readable presentation and sensible training protocol (Sec. 4.1, p. 7), with an (anonymous) code link in the Reproducibility Statement (p. 10). The principal strength of SegRet lies in its computational efficiency and architect
Vision RetNet is adopted largely as-is; the paper does not contribute new retention variants for vision, nor new theory atop RetNet. The retention machinery and bidirectional vision adaptation (BiRetention, horizontal/vertical decomposition) are recaps of prior work (Sec. 3.1; Eqs. 6–14). The “zero-initialized residual decoder” amounts to a zero-init 1×1 residual branch after channel unification, followed by standard upsample-concat-conv (Eqs. 15–18). This design is extremely close to well-know
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Robotics and Automated Systems · Graph Theory and Algorithms
