PolyFrame at MWE-2026 AdMIRe 2: When Words Are Not Enough: Multimodal Idiom Disambiguation
Nina Hosseini-Kivanani

TL;DR
This paper presents PolyFrame, a system for multimodal idiom disambiguation that improves performance by using lightweight modules and idiom-aware rewriting, achieving strong results across multiple languages without fine-tuning large encoders.
Contribution
PolyFrame introduces a unified pipeline with lightweight modules for multimodal idiom disambiguation, demonstrating effective performance without fine-tuning large vision-language models.
Findings
Performance improved from 26.7% to 60.0% Top-1 accuracy on English.
Achieved 0.822 NDCG@5 in zero-shot transfer to Portuguese.
Idiom-aware rewriting significantly boosts disambiguation accuracy.
Abstract
Multimodal models struggle with idiomatic expressions due to their non-compositional meanings, a challenge amplified in multilingual settings. We introduced PolyFrame, our system for the MWE-2026 AdMIRe2 shared task on multimodal idiom disambiguation, featuring a unified pipeline for both image+text ranking (Subtask A) and text-only caption ranking (Subtask B). All model variants retain frozen CLIP-style vision--language encoders and the multilingual BGE M3 encoder, training only lightweight modules: a logistic regression and LLM-based sentence-type predictor, idiom synonym substitution, distractor-aware scoring, and Borda rank fusion. Starting from a CLIP baseline (26.7% Top-1 on English dev, 6.7% on English test), adding idiom-aware paraphrasing and explicit sentence-type classification increased performance to 60.0% Top-1 on English and 60.0% Top-1 (0.822 NDCG@5) in zero-shot…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Language, Metaphor, and Cognition · Multimodal Machine Learning Applications
