LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation
Mohammad Abuzar Hashemi, Zhanghexuan Li, Mihir Chauhan, Yan Shen,, Abhishek Satbhai, Mir Basheer Ali, Mingchen Gao, Sargur Srihari

TL;DR
LAViTeR is a novel model that enhances visual and textual representations by integrating image synthesis and captioning tasks, leading to better alignment in joint embeddings for vision-language applications.
Contribution
The paper introduces LAViTeR, which combines alignment with auxiliary GAN-based image synthesis and captioning tasks for improved cross-modal representation learning.
Findings
Superior alignment on CUB and MS-COCO datasets
Effective joint embedding of visual and textual features
Enhanced performance in downstream vision-language tasks
Abstract
Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
