Contextual Vision Transformers for Robust Representation Learning

Yujia Bao; Theofanis Karaletsos

arXiv:2305.19402·cs.CV·October 2, 2023·2 cites

Contextual Vision Transformers for Robust Representation Learning

Yujia Bao, Theofanis Karaletsos

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper presents ContextViT, a novel vision transformer architecture that incorporates group-specific context tokens and inference mechanisms to enhance robust image representation learning under distribution shifts.

Contribution

Introduction of ContextViT with context tokens and inference network, enabling improved out-of-distribution generalization and stable feature learning across diverse applications.

Findings

01

Improved OOD generalization in supervised fine-tuning on iWildCam and FMoW.

02

Outperforms standard ViT in pathology and microscopy imaging benchmarks.

03

Effective in self-supervised learning for stable feature extraction.

Abstract

We introduce Contextual Vision Transformers (ContextViT), a method designed to generate robust image representations for datasets experiencing shifts in latent factors across various groups. Derived from the concept of in-context learning, ContextViT incorporates an additional context token to encapsulate group-specific information. This integration allows the model to adjust the image representation in accordance with the group-specific context. Specifically, for a given input image, ContextViT maps images with identical group membership into this context token, which is appended to the input image tokens. Additionally, we introduce a context inference network to predict such tokens on-the-fly, given a batch of samples from the group. This enables ContextViT to adapt to new testing distributions during inference time. We demonstrate the efficacy of ContextViT across a wide range of…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

The idea of capturing context information from the datasets is interesting. The writing of this method is clear and easy to follow. The experiments demonstrate the efficiency of their proposed framework on both the dataset with the same distribution and other datasets with different distributions.

Weaknesses

The view and impact of this paper are limited. It seems the method focuses on improving the performance of the datasets that contain several distinct groups. Although the authors demonstrate improvements on some specific datasets, the improvement in general image tasks is still unclear. It is suggested to widely evaluate their framework on other popular datasets and tasks or extend related techniques to improve the capability of transfer learning from one task to some other tasks. It should also

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

The paper presents a method to mitigate the distribution gap between different datasets. Based on their experimental results, the proposed method, ContextViT, has the ability to improve the performance under distribution shift.

Weaknesses

- The paper mentioned that the proposed method applies the concept of in-context learning in vision transformer. However, in my opinion, in-context learning is a kind of few-shot learning, which predicts based on the (data, label) pair of a few samples, unlike the usage of all the dataset-c data (or a batch of the data) in this paper. The method looks like a summarization of the dataset information and then makes the prediction based on that summarization. - The method requires a lot of distrib

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

- This paper is well-written and easy to follow. - Figure 1 is well drawn to illustrate the overall idea of this work. - Layer-wise context conditioning is well-motivated and makes sense.

Weaknesses

The novelty of this work is limited. - The intrinsic difference between this work and visual prompting [1] is unclear. It seems that visual prompting can also fit this OOD scenario. - The key idea of this work is similar to [2], which also uses a network to predict the context/domain tokens. - The comparison in experiment section is insufficient. - Lack of visualization of the learned context token, which shows the difference of context tokens of different groups. The paper is simple and effec

Code & Models

Repositories

insitro/contextvit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAI in cancer detection · Cell Image Analysis Techniques · Digital Imaging for Blood Diseases