Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

Francisco Nogueira; Alexandre Bernardino; Bruno Martins

arXiv:2511.11427·cs.CV·November 17, 2025

Comprehension of Multilingual Expressions Referring to Target Objects in Visual Inputs

Francisco Nogueira, Alexandre Bernardino, Bruno Martins

PDF

Open Access

TL;DR

This paper introduces a large-scale multilingual dataset for referring expression comprehension across 10 languages and proposes an attention-based neural model, demonstrating competitive performance and consistent multilingual capabilities.

Contribution

The work creates a comprehensive multilingual REC dataset and develops an attention-anchored neural architecture using multilingual SigLIP2 encoders for improved visual grounding.

Findings

01

Achieved 86.9% accuracy at IoU@50 on RefCOCO multilingual benchmark.

02

Constructed a dataset with 8 million expressions across 177,620 images.

03

Model shows consistent performance across multiple languages.

Abstract

Referring Expression Comprehension (REC) requires models to localize objects in images based on natural language descriptions. Research on the area remains predominantly English-centric, despite increasing global deployment demands. This work addresses multilingual REC through two main contributions. First, we construct a unified multilingual dataset spanning 10 languages, by systematically expanding 12 existing English REC benchmarks through machine translation and context-based translation enhancement. The resulting dataset comprises approximately 8 million multilingual referring expressions across 177,620 images, with 336,882 annotated objects. Second, we introduce an attention-anchored neural architecture that uses multilingual SigLIP2 encoders. Our attention-based approach generates coarse spatial anchors from attention distributions, which are subsequently refined through learned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Visual Attention and Saliency Detection