Contextual Vision Transformers for Robust Representation Learning
Yujia Bao, Theofanis Karaletsos

TL;DR
This paper presents ContextViT, a novel vision transformer architecture that incorporates group-specific context tokens and inference mechanisms to enhance robust image representation learning under distribution shifts.
Contribution
Introduction of ContextViT with context tokens and inference network, enabling improved out-of-distribution generalization and stable feature learning across diverse applications.
Findings
Improved OOD generalization in supervised fine-tuning on iWildCam and FMoW.
Outperforms standard ViT in pathology and microscopy imaging benchmarks.
Effective in self-supervised learning for stable feature extraction.
Abstract
We introduce Contextual Vision Transformers (ContextViT), a method designed to generate robust image representations for datasets experiencing shifts in latent factors across various groups. Derived from the concept of in-context learning, ContextViT incorporates an additional context token to encapsulate group-specific information. This integration allows the model to adjust the image representation in accordance with the group-specific context. Specifically, for a given input image, ContextViT maps images with identical group membership into this context token, which is appended to the input image tokens. Additionally, we introduce a context inference network to predict such tokens on-the-fly, given a batch of samples from the group. This enables ContextViT to adapt to new testing distributions during inference time. We demonstrate the efficacy of ContextViT across a wide range of…
Peer Reviews
Decision·Submitted to ICLR 2024
The idea of capturing context information from the datasets is interesting. The writing of this method is clear and easy to follow. The experiments demonstrate the efficiency of their proposed framework on both the dataset with the same distribution and other datasets with different distributions.
The view and impact of this paper are limited. It seems the method focuses on improving the performance of the datasets that contain several distinct groups. Although the authors demonstrate improvements on some specific datasets, the improvement in general image tasks is still unclear. It is suggested to widely evaluate their framework on other popular datasets and tasks or extend related techniques to improve the capability of transfer learning from one task to some other tasks. It should also
The paper presents a method to mitigate the distribution gap between different datasets. Based on their experimental results, the proposed method, ContextViT, has the ability to improve the performance under distribution shift.
- The paper mentioned that the proposed method applies the concept of in-context learning in vision transformer. However, in my opinion, in-context learning is a kind of few-shot learning, which predicts based on the (data, label) pair of a few samples, unlike the usage of all the dataset-c data (or a batch of the data) in this paper. The method looks like a summarization of the dataset information and then makes the prediction based on that summarization. - The method requires a lot of distrib
- This paper is well-written and easy to follow. - Figure 1 is well drawn to illustrate the overall idea of this work. - Layer-wise context conditioning is well-motivated and makes sense.
The novelty of this work is limited. - The intrinsic difference between this work and visual prompting [1] is unclear. It seems that visual prompting can also fit this OOD scenario. - The key idea of this work is similar to [2], which also uses a network to predict the context/domain tokens. - The comparison in experiment section is insufficient. - Lack of visualization of the learned context token, which shows the difference of context tokens of different groups. The paper is simple and effec
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAI in cancer detection · Cell Image Analysis Techniques · Digital Imaging for Blood Diseases
