Sparse Refinement for Efficient High-Resolution Semantic Segmentation
Zhijian Liu, Zhuoyang Zhang, Samir Khaki, Shang Yang, Haotian Tang,, Chenfeng Xu, Kurt Keutzer, Song Han

TL;DR
SparseRefine is a novel method that efficiently enhances low-resolution semantic segmentation with sparse high-resolution refinements, enabling faster processing of high-res images with minimal accuracy loss.
Contribution
It introduces a universal sparse refinement framework that improves high-resolution semantic segmentation efficiency across various models.
Findings
Achieves 1.5 to 3.7 times speedup on multiple models.
Maintains accuracy with negligible to no loss.
Applicable to CNN- and ViT-based models.
Abstract
Semantic segmentation empowers numerous real-world applications, such as autonomous driving and augmented/mixed reality. These applications often operate on high-resolution images (e.g., 8 megapixels) to capture the fine details. However, this comes at the cost of considerable computational complexity, hindering the deployment in latency-sensitive scenarios. In this paper, we introduce SparseRefine, a novel approach that enhances dense low-resolution predictions with sparse high-resolution refinements. Based on coarse low-resolution outputs, SparseRefine first uses an entropy selector to identify a sparse set of pixels with high entropy. It then employs a sparse feature extractor to efficiently generate the refinements for those pixels of interest. Finally, it leverages a gated ensembler to apply these sparse refinements to the initial coarse predictions. SparseRefine can be seamlessly…
Peer Reviews
Decision·Submitted to ICLR 2024
* The idea is simple and makes sense. The area to be refined is indeed sparse, and using sparse NN to the refined area makes sense and should improve the time-complexity. * I enjoyed the generality of the method. Because the method does not assume any restrictions on the segmentation architecture and only uses the segmentation logit, the method is applicable to any segmentation model. The segmentation model can be plug-and-play. * The experiments are well conducted. The authors show the genera
* I’m not sure how the training data for the refinement was created. To train the refinement module, sparse high entropy pixels are required. How are the high entropy pixels acquired? Is it acquired from the pretrained segmentation architectures? Also, is the refinement model trained for each of the NN architectures in Table 1, or is it universal?
This work explores applying a sparse refinement on the interpolated coarse prediction, which uses an entropy selector to help to sparsely identify the erroneous regions, without the need to refine the prediction in a full image-size. Thus, this approach gives a reduction in computation during inference.
1. I agree that the integration of multiple components into a feasible solution is a non-trivial task. However, the composition of such existing works implies that the proposed work lacks sufficient novelties. 2. Although the authors claim the proposed work provides a significant speedup in inference. However, a comparison in terms of a more persuasive metric, GFLOPS, is missing, which is independent of the machine speed and commonly used for measuring the inference efficiency of a network model
S1. The proposed method succeeds to improve the inference speed (1.5x - 2.0x) of popular heavy-weight models while keeping the mIoU performance. S2. Sparse feature extraction appears as a powerful and under-researched computer vision technique. S3. Simplicity of the method will likely lead to derivative future work. S4. I was really surprised that looking at sparse pixels with so little context could contribute that much to the final performance. S5. I was also surprised that showing low res
W1. The three components of the solution (entropy-based uncertainty, Minkowski engine, weighted ensembes) have been proposed in the related work. W2. Proper validation of hyper-parameter \alpha has not been discussed (validating on test data is not acceptable), W3. Training the sparse feature extractor requires a lot of computational power (96 RTXA6000 days).
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and Data Classification · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
MethodsSparse Evolutionary Training
