ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation   Using Large Language Models and Transformers

Aristi Papastavrou; Maria Lymperaiou; Giorgos Stamou

arXiv:2408.06040·cs.CV·August 13, 2024

ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers

Aristi Papastavrou, Maria Lymperaiou, Giorgos Stamou

PDF

Open Access

TL;DR

ARPA is a hybrid model combining large language models, transformers, and GNNs to improve visual word disambiguation by effectively integrating linguistic and visual data, setting new performance benchmarks.

Contribution

The paper introduces ARPA, a novel hybrid architecture that fuses language models, transformers, and GNNs for enhanced multimodal disambiguation, a significant advancement over existing methods.

Findings

01

ARPA outperforms previous models in VWSD tasks.

02

The model demonstrates robustness with complex disambiguation scenarios.

03

Experimental results show improved accuracy through data augmentation and multimodal training.

Abstract

In the rapidly evolving fields of natural language processing and computer vision, Visual Word Sense Disambiguation (VWSD) stands as a critical, yet challenging task. The quest for models that can seamlessly integrate and interpret multimodal data is more pressing than ever. Imagine a system that can understand language with the depth and nuance of human cognition, while simultaneously interpreting the rich visual context of the world around it. We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers, which then pass through a custom Graph Neural Network (GNN) layer to learn intricate relationships and subtle nuances within the data. This innovative architecture not only sets a new benchmark in visual word disambiguation but also introduces a versatile framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Retrieval and Classification Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications

MethodsGraph Neural Network