Visual Concepts Tokenization
Tao Yang, Yuwang Wang, Yan Lu, Nanning Zheng

TL;DR
This paper introduces VCT, an unsupervised transformer framework that tokenizes images into disentangled visual concepts, improving scene understanding and representation learning.
Contribution
VCT is the first unsupervised transformer-based method to produce independent visual concept tokens using cross-attention and a novel disentangling loss.
Findings
VCT achieves state-of-the-art results on multiple datasets.
VCT effectively disentangles visual concepts without supervision.
VCT improves scene decomposition accuracy.
Abstract
Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Cell Image Analysis Techniques · Multimodal Machine Learning Applications
