Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words

Yujia Bao; Srinivasan Sivanandan; Theofanis Karaletsos

arXiv:2309.16108·cs.CV·April 22, 2024·6 cites

Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words

Yujia Bao, Srinivasan Sivanandan, Theofanis Karaletsos

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces ChannelViT, a modified Vision Transformer architecture designed for multi-channel images, with a new regularization technique, Hierarchical Channel Sampling, to improve robustness and generalization in scenarios with sparse or partial input channels.

Contribution

The paper proposes ChannelViT, which constructs patch tokens per channel with learnable embeddings, and introduces Hierarchical Channel Sampling for enhanced robustness to missing channels during testing.

Findings

01

ChannelViT outperforms ViT on ImageNet, JUMP-CP, and So2Sat datasets.

02

HCS improves model robustness regardless of architecture.

03

ChannelViT generalizes well with limited channel access during training.

Abstract

Vision Transformer (ViT) has emerged as a powerful architecture in the realm of modern computer vision. However, its application in certain imaging fields, such as microscopy and satellite imaging, presents unique challenges. In these domains, images often contain multiple channels, each carrying semantically distinct and independent information. Furthermore, the model must demonstrate robustness to sparsity in input channels, as they may not be densely available during training or testing. In this paper, we propose a modification to the ViT architecture that enhances reasoning across the input channels and introduce Hierarchical Channel Sampling (HCS) as an additional regularization technique to ensure robustness when only partial channels are presented during test time. Our proposed model, ChannelViT, constructs patch tokens independently from each input channel and utilizes a…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

+ The Hierarchical Channel Sampling (HCS) module enhances robustness by performing channel-wise sampling, which proves beneficial in scenarios involving incomplete image channels. + ChannelViT surpasses the conventional Vision Transformer (ViT) by demonstrating insensitivity to the number of input image channels, where ViT shows vulnerability. + A novel two-stage sampling algorithm is introduced within ChannelViT to selectively obscure input channels, optimizing the model's performance.

Weaknesses

- There is a potential risk of information loss, as highlighted in Section 3 of the methodology. The model's approach to segmenting the input image into various channel sequences and processing them individually could disrupt the alignment of channels, particularly in a 3-channel image prediction task. - The patch embedding technique described in Section 3.1 overlooks the issue of channel alignment when deconstructing images into separate channels. - The innovation of the proposed metho

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- ChannelViT with HCS proposes a simple but relatively intuitive extension of ViTs, which would have unique applications in multiplexed imaging in which not all "channels" (e.g. - fluorescent probes) are made available. Overall, this work presents a method that targets an important application and may enable significant biological / clinical findings in multiplexed imaging. - The ablation experiments regarding assessment of partial channels, comparisons and baselines with training on single chan

Weaknesses

- While the main application of ChannelViT is in targeting problems such as those in multiplexed imaging due to the challenge of generalizing encoders across datasets that would have the same set of probes, only one dataset explored in this work is related to multiplexed imaging. Experimentation on ImageNet, C17-WILDS, and others are informative and appreciated, are not as relevant to the ultimate application that ChannelViT serves. Though C17-WILDS is microscopy, hematoxylin and eosin (H&E) pat

Reviewer 03Rating 8· accept, good paperConfidence 5

Strengths

* Well motivated paper. The problem of processing multi-channel images is important and the paper presents arguments to design methods that deal with this information effectively. * The method is simple and effective. Splitting images in a sequence of tokens is an intuitive approach and the results show its effectiveness. * Beyond tokenizing, an important aspect of the method is sampling channels strategically to learn their associations. * The analysis in various datasets, including ImageNet, J

Weaknesses

Major comments: * Limited baselines reported. The paper only considers ViTs as a baseline, which is a natural comparison given that the proposed method is an extension of this architecture. However, comparison to other architectures, especially CNNs to solve the multi-channel problems should be reported. This is specially important to appreciate the relative performance compared to other solutions tested before in multi-channel images. * Computational cost increases quadratically with the number

Code & Models

Repositories

insitro/channelvit
pytorchOfficial

Videos

Channel Vision Transformers: An Image Is Worth 1 x 16 x 16 Words· slideslive

Taxonomy

TopicsCell Image Analysis Techniques · Image Processing Techniques and Applications · Advanced Fluorescence Microscopy Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Absolute Position Encodings · Dense Connections · Layer Normalization · Byte Pair Encoding