MATE: Meet At The Embedding -- Connecting Images with Long Texts
Young Kyun Jang, Junmo Kang, Yong Jae Lee, Donghyun Kim

TL;DR
MATE is a novel method that enhances vision-language models to effectively connect images with long texts by integrating large language models and a multi-stage alignment process, enabling better understanding of complex textual information.
Contribution
MATE introduces a new approach combining VLMs with LLMs for long text-image alignment without extra image-long text pairs, using a multi-stage training and a projection module.
Findings
MATE outperforms existing models on new long-text image retrieval benchmarks.
It effectively captures diverse semantic relationships between images and long texts.
Experimental results validate the approach's ability to handle complex textual data.
Abstract
While advancements in Vision Language Models (VLMs) have significantly improved the alignment of visual and textual data, these models primarily focus on aligning images with short descriptive captions. This focus limits their ability to handle complex text interactions, particularly with longer texts such as lengthy captions or documents, which have not been extensively explored yet. In this paper, we introduce Meet At The Embedding (MATE), a novel approach that combines the capabilities of VLMs with Large Language Models (LLMs) to overcome this challenge without the need for additional image-long text pairs. Specifically, we replace the text encoder of the VLM with a pretrained LLM-based encoder that excels in understanding long texts. To bridge the gap between VLM and LLM, MATE incorporates a projection module that is trained in a multi-stage manner. It starts by aligning the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputability, Logic, AI Algorithms
MethodsFocus · MATE · ALIGN
