OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and   Generation

Jing Liu; Xinxin Zhu; Fei Liu; Longteng Guo; Zijia Zhao; Mingzhen Sun,; Weining Wang; Hanqing Lu; Shiyu Zhou; Jiajun Zhang; Jinqiao Wang

arXiv:2107.00249·cs.CV·July 7, 2021·21 cites

OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation

Jing Liu, Xinxin Zhu, Fei Liu, Longteng Guo, Zijia Zhao, Mingzhen Sun,, Weining Wang, Hanqing Lu, Shiyu Zhou, Jiajun Zhang, Jinqiao Wang

PDF

Open Access 2 Repos

TL;DR

The paper introduces OPT, a comprehensive pre-training framework that jointly models visual, textual, and audio data to enhance cross-modal understanding and generation capabilities.

Contribution

It presents a novel encoder-decoder architecture with multi-task pre-training on large-scale triplet data for improved multi-modal alignment and translation.

Findings

01

OPT achieves strong multi-modal representations.

02

It performs well on various cross-modal tasks.

03

The multi-task scheme effectively models different data granularities.

Abstract

In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications