TL;DR
This paper presents a novel approach combining generative LLMs and multilingual CLIP models to improve the ranking of images based on idiomatic nominal compounds in English and Portuguese, demonstrating enhanced multimodal representations.
Contribution
The work introduces a new multimodal method leveraging LLM-generated idiomatic meanings and CLIP embeddings, with contrastive learning and data augmentation for better image ranking.
Findings
Multimodal representations outperform original nominal compounds.
Fine-tuning yields less improvement than using embeddings directly.
The approach effectively captures idiomatic meanings across languages.
Abstract
SemEval-2025 Task 1 focuses on ranking images based on their alignment with a given nominal compound that may carry idiomatic meaning in both English and Brazilian Portuguese. To address this challenge, this work uses generative large language models (LLMs) and multilingual CLIP models to enhance idiomatic compound representations. LLMs generate idiomatic meanings for potentially idiomatic compounds, enriching their semantic interpretation. These meanings are then encoded using multilingual CLIP models, serving as representations for image ranking. Contrastive learning and data augmentation techniques are applied to fine-tune these embeddings for improved performance. Experimental results show that multimodal representations extracted through this method outperformed those based solely on the original nominal compounds. The fine-tuning approach shows promising outcomes but is less…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsContrastive Learning · Contrastive Language-Image Pre-training
