Visual Concepts Tokenization

Tao Yang; Yuwang Wang; Yan Lu; Nanning Zheng

arXiv:2205.10093·cs.CV·October 14, 2022·6 cites

Visual Concepts Tokenization

Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng

PDF

Open Access 2 Repos 1 Video

TL;DR

This paper introduces VCT, an unsupervised transformer framework that tokenizes images into disentangled visual concepts, improving scene understanding and representation learning.

Contribution

VCT is the first unsupervised transformer-based method to produce independent visual concept tokens using cross-attention and a novel disentangling loss.

Findings

01

VCT achieves state-of-the-art results on multiple datasets.

02

VCT effectively disentangles visual concepts without supervision.

03

VCT improves scene decomposition accuracy.

Abstract

Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Visual Concepts Tokenization· slideslive

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Cell Image Analysis Techniques · Multimodal Machine Learning Applications