Image-Specific Adaptation of Transformer Encoders for Compute-Efficient Segmentation
Manyi Yao, Abhishek Aich, Yumin Suh, Amit Roy-Chowdhury, Christian Shelton, Manmohan Chandraker

TL;DR
This paper introduces ECO-M2F, a method for adaptively selecting the number of encoder layers in vision transformer models for image segmentation, reducing computational costs while maintaining accuracy.
Contribution
ECO-M2F enables self-adaptive encoder depth in transformer models, balancing performance and efficiency through a three-step training process.
Findings
Reduces encoder computational cost without sacrificing segmentation accuracy.
Adapts to different user compute resources effectively.
Extensible beyond segmentation to object detection.
Abstract
Vision transformer based models bring significant improvements for image segmentation tasks. Although these architectures offer powerful capabilities irrespective of specific segmentation tasks, their use of computational resources can be taxing on deployed devices. One way to overcome this challenge is by adapting the computation level to the specific needs of the input image rather than the current one-size-fits-all approach. To this end, we introduce ECO-M2F or EffiCient TransfOrmer Encoders for Mask2Former-style models. Noting that the encoder module of M2F-style models incur high resource-intensive computations, ECO-M2F provides a strategy to self-select the number of hidden layers in the encoder, conditioned on the input image. To enable this self-selection ability for providing a balance between performance and computational efficiency, we present a three step recipe. The first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
