ARPA: A Novel Hybrid Model for Advancing Visual Word Disambiguation Using Large Language Models and Transformers
Aristi Papastavrou, Maria Lymperaiou, Giorgos Stamou

TL;DR
ARPA is a hybrid model combining large language models, transformers, and GNNs to improve visual word disambiguation by effectively integrating linguistic and visual data, setting new performance benchmarks.
Contribution
The paper introduces ARPA, a novel hybrid architecture that fuses language models, transformers, and GNNs for enhanced multimodal disambiguation, a significant advancement over existing methods.
Findings
ARPA outperforms previous models in VWSD tasks.
The model demonstrates robustness with complex disambiguation scenarios.
Experimental results show improved accuracy through data augmentation and multimodal training.
Abstract
In the rapidly evolving fields of natural language processing and computer vision, Visual Word Sense Disambiguation (VWSD) stands as a critical, yet challenging task. The quest for models that can seamlessly integrate and interpret multimodal data is more pressing than ever. Imagine a system that can understand language with the depth and nuance of human cognition, while simultaneously interpreting the rich visual context of the world around it. We present ARPA, an architecture that fuses the unparalleled contextual understanding of large language models with the advanced feature extraction capabilities of transformers, which then pass through a custom Graph Neural Network (GNN) layer to learn intricate relationships and subtle nuances within the data. This innovative architecture not only sets a new benchmark in visual word disambiguation but also introduces a versatile framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Retrieval and Classification Techniques · Video Analysis and Summarization · Multimodal Machine Learning Applications
MethodsGraph Neural Network
