Masked AutoDecoder is Effective Multi-Task Vision Generalist
Han Qiu, Jiaxing Huang, Peng Gao, Lewei Lu, Xiaoqin Zhang, Shijian Lu

TL;DR
Masked AutoDecoder (MAD) introduces a parallel, bi-directional attention-based framework for multi-task vision modeling, achieving superior efficiency and competitive accuracy across various tasks.
Contribution
MAD presents a novel parallel decoding approach with masked sequence modeling for unified multi-task vision learning, reducing task-specific design complexity.
Findings
Outperforms autoregressive models in efficiency and accuracy
Handles multiple vision tasks with a single network branch
Achieves competitive results with minimal task-specific modifications
Abstract
Inspired by the success of general-purpose models in NLP, recent studies attempt to unify different vision tasks in the same sequence format and employ autoregressive Transformers for sequence prediction. They apply uni-directional attention to capture sequential dependencies and generate task sequences recursively. However, such autoregressive Transformers may not fit vision tasks well, as vision task sequences usually lack the sequential dependencies typically observed in natural languages. In this work, we design Masked AutoDecoder~(MAD), an effective multi-task vision generalist. MAD consists of two core designs. First, we develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies comprehensively and decode vision task sequences in parallel. Second, we design a masked sequence modeling approach that learns rich task contexts by…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1 The idea of combining a non-autoregressive method with the pix2seq framework is very original and appropriate. 2 The experimental results are very solid, and the superior performance and inference efficiency of the MAD model over autoregressive models are demonstrated through various vision tasks such as object detection, instance segmentation, keypoint detection, and image captioning. 3 The paper is well-written and the content is well-organized. The concepts and methodology are discussed i
1 Lack of Comparison in Inference Efficiency: The paper could benefit from a more comprehensive comparison of the inference efficiency of the proposed MAD model with other methods. While Table 1 provides some comparison, it would be beneficial to see a more detailed analysis, including various models and methods. 2 Limited Ablation Studies on Inference Strategies: The paper could further improve by providing more ablation studies on the inference strategies. Understanding how different aspect
1) It is quite an interesting discovery that such a small modification can lead to such an enhancement. 3) Easy to understand.
1) The writing of this paper is not very clear. In the method part, the authors put a lot of effort into introducing the overlap part with Pix2Seq V2, such as the tokenizer and masked modeling. However, the difference from Pix2Seq V2 is not well presented. Given that the ICLR has a quite tight page limit, I am quite astonished that the appendix is very short (I would have thought I could find more details in the appendix). As a result, I am still not very clear on how to achieve the change from
* The paper focuses on a practical problem. * The method is simple and shown to improve the performance. * The method is shown to be fast.
* The proposed method is very related and similar to Pix2SeqV2[A]. More specifically, the method can be viewed as introducing a MAD to [A]. Though in Table 1, the proposed method obtains better results than Pix2SeqV2, the proposed method use a different backbone, making it difficult to measure the effectiveness and efficiency of the proposed method. As the authors follow the same setup as the Pix2SeqV2, why not build the proposed method based on Pix2SeqV2? Also, though the ablation study is give
+ The direction of building a generalist model for computer vision is interesting, and in my belief a next frontier to explore on the evaluation side following the trajectory of NLP. + I believe the idea in the paper works -- meaning that masked autoencoding on the latent feature space should still work, and should help the final performance on multiple tasks.
- Following the second point of strengths -- unfortunately I think it can be viewed as a weakness of the paper too. Since the common belief is that masked autoencoding should work, it is not presenting new knowledge to me even if it works. I do think the exploration has its value (the value to verify it works), but the value is not too significant, and as time goes by (after people try on many other domains) the value diminishes. - The writing/presentation is okay but not great. I can follow the
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual perception and processing mechanisms
