Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words
Yujia Bao, Srinivasan Sivanandan, Theofanis Karaletsos

TL;DR
This paper introduces ChannelViT, a modified Vision Transformer architecture designed for multi-channel images, with a new regularization technique, Hierarchical Channel Sampling, to improve robustness and generalization in scenarios with sparse or partial input channels.
Contribution
The paper proposes ChannelViT, which constructs patch tokens per channel with learnable embeddings, and introduces Hierarchical Channel Sampling for enhanced robustness to missing channels during testing.
Findings
ChannelViT outperforms ViT on ImageNet, JUMP-CP, and So2Sat datasets.
HCS improves model robustness regardless of architecture.
ChannelViT generalizes well with limited channel access during training.
Abstract
Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a…
Peer Reviews
Decision·ICLR 2024 poster
+ The Hierarchical Channel Sampling (HCS) module enhances robustness by performing channel-wise sampling, which proves beneficial in scenarios involving incomplete image channels. + ChannelViT surpasses the conventional Vision Transformer (ViT) by demonstrating insensitivity to the number of input image channels, where ViT shows vulnerability. + A novel two-stage sampling algorithm is introduced within ChannelViT to selectively obscure input channels, optimizing the model's performance.
- There is a potential risk of information loss, as highlighted in Section 3 of the methodology. The model's approach to segmenting the input image into various channel sequences and processing them individually could disrupt the alignment of channels, particularly in a 3-channel image prediction task. - The patch embedding technique described in Section 3.1 overlooks the issue of channel alignment when deconstructing images into separate channels. - The innovation of the proposed metho
- ChannelViT with HCS proposes a simple but relatively intuitive extension of ViTs, which would have unique applications in multiplexed imaging in which not all "channels" (e.g. - fluorescent probes) are made available. Overall, this work presents a method that targets an important application and may enable significant biological / clinical findings in multiplexed imaging. - The ablation experiments regarding assessment of partial channels, comparisons and baselines with training on single chan
- While the main application of ChannelViT is in targeting problems such as those in multiplexed imaging due to the challenge of generalizing encoders across datasets that would have the same set of probes, only one dataset explored in this work is related to multiplexed imaging. Experimentation on ImageNet, C17-WILDS, and others are informative and appreciated, are not as relevant to the ultimate application that ChannelViT serves. Though C17-WILDS is microscopy, hematoxylin and eosin (H&E) pat
* Well motivated paper. The problem of processing multi-channel images is important and the paper presents arguments to design methods that deal with this information effectively. * The method is simple and effective. Splitting images in a sequence of tokens is an intuitive approach and the results show its effectiveness. * Beyond tokenizing, an important aspect of the method is sampling channels strategically to learn their associations. * The analysis in various datasets, including ImageNet, J
Major comments: * Limited baselines reported. The paper only considers ViTs as a baseline, which is a natural comparison given that the proposed method is an extension of this architecture. However, comparison to other architectures, especially CNNs to solve the multi-channel problems should be reported. This is specially important to appreciate the relative performance compared to other solutions tested before in multi-channel images. * Computational cost increases quadratically with the number
Code & Models
Videos
Taxonomy
TopicsCell Image Analysis Techniques · Image Processing Techniques and Applications · Advanced Fluorescence Microscopy Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Absolute Position Encodings · Dense Connections · Layer Normalization · Byte Pair Encoding
