LAViTeR: Learning Aligned Visual and Textual Representations Assisted by   Image and Caption Generation

Mohammad Abuzar Hashemi; Zhanghexuan Li; Mihir Chauhan; Yan Shen,; Abhishek Satbhai; Mir Basheer Ali; Mingchen Gao; Sargur Srihari

arXiv:2109.04993·cs.CV·October 2, 2024

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Mohammad Abuzar Hashemi, Zhanghexuan Li, Mihir Chauhan, Yan Shen,, Abhishek Satbhai, Mir Basheer Ali, Mingchen Gao, Sargur Srihari

PDF

Open Access 1 Repo

TL;DR

LAViTeR is a novel model that enhances visual and textual representations by integrating image synthesis and captioning tasks, leading to better alignment in joint embeddings for vision-language applications.

Contribution

The paper introduces LAViTeR, which combines alignment with auxiliary GAN-based image synthesis and captioning tasks for improved cross-modal representation learning.

Findings

01

Superior alignment on CUB and MS-COCO datasets

02

Effective joint embedding of visual and textual features

03

Enhanced performance in downstream vision-language tasks

Abstract

Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mshaikh2/MMRL
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques