Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

Samuel Cahyawijaya; Peerat Limkonchotiwat; Tack Hwa Wong; Hitesh Laxmichand Patel; Amit Agarwal; Manuel Antonio Rufino; Carlos Rafael Catalan; Muhammad Reza Qorib; Vicky Feliren; Holy Lovenia; Aye Hninn Khine; Frederikus Hudi; David Anugraha; Alham Fikri Aji; Romrawin Chumpu; Viet-Thanh Pham; Minghan Wang; Mohamed Fazli Imam; Ruochen Zhang; Joseph Marvin Imperial; Khumaisa Nur'aini; Do Xuan Long; Musa Izzanardi Wijanarko; Joel Ruben Antony Moniz; Patrick Amadeus Irawan; Hanif Muhammad Zhafran; Isaiah Flores; Salsabila Zahirah Pranida; Jun Kevin; Jostin Jerico Rosal; Patricia Nicole Monderin; Kun Kerdthaisong; Ahmad Mustafid; My Chiffon Nguyen; Natchapon Jongwiriyanurak; Siva Worajitwannakul; Haochen Li; Adrian Xuan Wei Lim; Bin Wang; Muhammad Ravi Shulthan Habibi; Lynnette Hui Xian Ng; Mithil Bangera; Yeshil Bangera; Priyaranjan Pattnayak; Dun Li Chan; Sherissa Caren Djuniwar; Cho Chan Myei Oo; Hee Ming Shan

arXiv:2604.11490·cs.AI·April 20, 2026

Anthropogenic Regional Adaptation in Multimodal Vision-Language Model

Samuel Cahyawijaya, Peerat Limkonchotiwat, Tack Hwa Wong, Hitesh Laxmichand Patel, Amit Agarwal, Manuel Antonio Rufino, Carlos Rafael Catalan, Muhammad Reza Qorib, Vicky Feliren, Holy Lovenia, Aye Hninn Khine, Frederikus Hudi, David Anugraha, Alham Fikri Aji, Romrawin Chumpu

PDF

TL;DR

This paper introduces Anthropogenic Regional Adaptation and a simple adaptation method GG-EZ to improve regional relevance in vision-language models while maintaining global performance, demonstrated through experiments in Southeast Asia.

Contribution

It proposes a new paradigm for regional adaptation in vision-language models and a straightforward method, GG-EZ, to optimize regional relevance without sacrificing global generalization.

Findings

01

GG-EZ achieves 5-15% improvement in regional relevance metrics.

02

Models retain over 98% of their global performance after adaptation.

03

Regional adaptation enhances cultural relevance in Southeast Asia.

Abstract

While the field of vision-language (VL) has achieved remarkable success in integrating visual and textual information across multiple languages and domains, there is still no dedicated framework for assessing human-centric alignment in vision-language systems. We offer two contributions to address this gap. First, we introduce Anthropogenic Regional Adaptation: a novel paradigm that aims to optimize model relevance to specific regional contexts while ensuring the retention of global generalization capabilities. Second, we present a simple, but effective adaptation method named Geographical-generalization-made-easy (GG-EZ), which utilizes regional data filtering and model merging. Through comprehensive experiments on 3 VL architectures: large vision-language models, text-to-image diffusion models, and vision-language embedding models, and a case study in Southeast Asia (SEA) regional…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.