Win-Win: Training High-Resolution Vision Transformers from Two Windows
Vincent Leroy, Jerome Revaud, Thomas Lucas, Philippe Weinzaepfel

TL;DR
This paper introduces Win-Win, a novel training strategy for high-resolution vision transformers that masks most inputs during training, enabling efficient high-res inference and achieving state-of-the-art results in dense pixelwise tasks.
Contribution
The paper proposes a new window-based masking approach for training high-resolution vision transformers, improving efficiency and performance without complex post-processing.
Findings
4x faster training compared to full-resolution networks
Effective on semantic segmentation, monocular depth, and optical flow tasks
Achieves state-of-the-art results with faster inference
Abstract
Transformers have become the standard in state-of-the-art vision architectures, achieving impressive performance on both image-level and dense pixelwise tasks. However, training vision transformers for high-resolution pixelwise tasks has a prohibitive cost. Typical solutions boil down to hierarchical architectures, fast and approximate attention, or training on low-resolution crops. This latter solution does not constrain architectural choices, but it leads to a clear performance drop when testing at resolutions significantly higher than that used for training, thus requiring ad-hoc and slow post-processing schemes. In this paper, we propose a novel strategy for efficient training and inference of high-resolution vision transformers. The key principle is to mask out most of the high-resolution inputs during training, keeping only N random windows. This allows the model to learn local…
Peer Reviews
Decision·ICLR 2024 poster
The proposes method enables a simple single forward inference process on high-resolution images without performance drop. In comparison to existing approach, most requires extra effort of aligning train and test resolution difference, e.g. aggregating predictions from multiple small patches. The ablation study regarding the window generation strategy is quite extensive, including various ways of choosing windows, how many windows, window size, square or non-square window, etc. The paper als
While the extensive experiments show the two window strategy is the best one with its simple and good performance, some analysis of such strategy/experiment results is missing. For example, in Table 1 right, why using 1021 tokens will drop 0.6 performance to using 1009 is not clear. The results from Table 3 indicates the select of window strategy affects a lot for optical flow task, which suggests difficulties of generating to other data/task (search of window is needed).
Strength: 1. The proposed model provides competitive performance while reducing the training time 2. The proposed model can be generalized and applied to various tasks like segmentation and binocular task of optical flow. 3. It is easy to apply the proposed strategy.
1. The paper presents semantic segmentation result for a single train and test resolution (1280x720). Does the performance hold for the solution exceeding 1280x720? 2. Win-Win is better than ViT-Det by a mere 0.3%, suggesting a marginal enhancement. Can 0.3% deemed as a substantial improvement?
1. Easy to implement: The method is simple and easy to integrate into the dense prediction tasks like semantic segmentation and binocular tasks like optical flow estimation. 2. Efficient Training: The Win-Win strategy reduces training time by a factor of 4 compared to full-resolution networks. It achieves this by focusing on random windows instead of processing the entire high-resolution input.
1. Lack of Comparative Analysis: It does not provide a comprehensive comparative analysis with a wide range of existing methods for training high-resolution vision transformers. 2. Lack of Novelty: Masking out image patches is not a new approach in the literature, and it is simple data augmentation tuning to mask out most of them. I highly recommend the authors to do in-depth exploration and analysis of this method.
Videos
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsHigh-resolution input
