Language Quantized AutoEncoders: Towards Unsupervised Text-Image   Alignment

Hao Liu; Wilson Yan; Pieter Abbeel

arXiv:2302.00902·cs.LG·February 6, 2023·5 cites

Language Quantized AutoEncoders: Towards Unsupervised Text-Image Alignment

Hao Liu, Wilson Yan, Pieter Abbeel

PDF

Open Access 1 Repo

TL;DR

This paper introduces Language-Quantized AutoEncoder (LQAE), an unsupervised method that aligns images with text representations using pretrained language models, enabling multimodal tasks without requiring aligned image-text datasets.

Contribution

LQAE is the first approach to use unaligned images for multimodal tasks by leveraging pretrained language models and quantized image embeddings.

Findings

01

Enables few-shot image classification with large language models.

02

Achieves linear classification of images based on BERT text features.

03

Aligns images and text without requiring paired datasets.

Abstract

Recent progress in scaling up large language models has shown impressive capabilities in performing few-shot learning across a wide range of text-based tasks. However, a key limitation is that these language models fundamentally lack visual perception - a crucial attribute needed to extend these models to be able to interact with the real world and solve vision tasks, such as in visual-question answering and robotics. Prior works have largely connected image to text through pretraining and/or fine-tuning on curated image-text datasets, which can be a costly and expensive process. In order to resolve this limitation, we propose a simple yet effective approach called Language-Quantized AutoEncoder (LQAE), a modification of VQ-VAE that learns to align text-image data in an unsupervised manner by leveraging pretrained language models (e.g., BERT, RoBERTa). Our main idea is to encode image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

lhao499/language-quantized-autoencoders
jax

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Residual Connection · Dense Connections · Dropout · Softmax · Layer Normalization · ALIGN · Linear Warmup With Linear Decay