Survey: Transformer-based Models in Data Modality Conversion
Elyas Rashno, Amir Eskandari, Aman Anand, and Farhana Zulkernine

TL;DR
This survey reviews transformer-based models for converting data between text, vision, and speech modalities, highlighting their architectures, methods, and applications to demonstrate their versatility in AI content generation.
Contribution
It provides a comprehensive, systematic review of transformer models for modality conversion, filling a gap in existing literature.
Findings
Transformers are effective across multiple data modalities.
Various architectures and methods have been developed for modality conversion.
Transformers significantly enhance AI content understanding and generation.
Abstract
Transformers have made significant strides across various artificial intelligence domains, including natural language processing, computer vision, and audio processing. This success has naturally garnered considerable interest from both academic and industry researchers. Consequently, numerous Transformer variants (often referred to as X-formers) have been developed for these fields. However, a thorough and systematic review of these modality-specific conversions remains lacking. Modality Conversion involves the transformation of data from one form of representation to another, mimicking the way humans integrate and interpret sensory information. This paper provides a comprehensive review of transformer-based models applied to the primary modalities of text, vision, and speech, discussing their architectures, conversion methodologies, and applications. By synthesizing the literature on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsAttention Is All You Need · Linear Layer · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Softmax · Absolute Position Encodings · Dense Connections
