Survey: Transformer-based Models in Data Modality Conversion

Elyas Rashno; Amir Eskandari; Aman Anand; and Farhana Zulkernine

arXiv:2408.04723·eess.IV·August 12, 2024

Survey: Transformer-based Models in Data Modality Conversion

Elyas Rashno, Amir Eskandari, Aman Anand, and Farhana Zulkernine

PDF

TL;DR

This survey reviews transformer-based models for converting data between text, vision, and speech modalities, highlighting their architectures, methods, and applications to demonstrate their versatility in AI content generation.

Contribution

It provides a comprehensive, systematic review of transformer models for modality conversion, filling a gap in existing literature.

Findings

01

Transformers are effective across multiple data modalities.

02

Various architectures and methods have been developed for modality conversion.

03

Transformers significantly enhance AI content understanding and generation.

Abstract

Transformers have made significant strides across various artificial intelligence domains, including natural language processing, computer vision, and audio processing. This success has naturally garnered considerable interest from both academic and industry researchers. Consequently, numerous Transformer variants (often referred to as X-formers) have been developed for these fields. However, a thorough and systematic review of these modality-specific conversions remains lacking. Modality Conversion involves the transformation of data from one form of representation to another, mimicking the way humans integrate and interpret sensory information. This paper provides a comprehensive review of transformer-based models applied to the primary modalities of text, vision, and speech, discussing their architectures, conversion methodologies, and applications. By synthesizing the literature on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections