ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

Deeparghya Dutta Barua; Md Sakib Ul Rahman Sourove; Md Fahim; Fabiha Haider; Fariha Tanjim Shifat; Md Tasmim Rahman Adib; Anam Borhan Uddin; Md Farhan Ishmam; Md Farhad Alam

arXiv:2410.14991·cs.CV·June 3, 2025

ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

Deeparghya Dutta Barua, Md Sakib Ul Rahman Sourove, Md Fahim, Fabiha Haider, Fariha Tanjim Shifat, Md Tasmim Rahman Adib, Anam Borhan Uddin, Md Farhan Ishmam, Md Farhad Alam

PDF

Open Access 1 Datasets

TL;DR

This paper introduces ChitroJera, a large-scale, regionally relevant Bangla VQA dataset, and evaluates various models, demonstrating the superiority of dual-encoders and large vision-language models in this low-resource language context.

Contribution

The paper presents the first large-scale Bangla VQA dataset with regional relevance and benchmarks multiple models, including novel dual-encoder architectures.

Findings

01

Pre-trained dual-encoders outperform other models of similar scale.

02

Large vision-language models achieve the best performance with prompt-based evaluation.

03

ChitroJera expands the scope of Vision-Language tasks in Bangla.

Abstract

Visual Question Answer (VQA) poses the problem of answering a natural language question about a visual context. Bangla, despite being a widely spoken language, is considered low-resource in the realm of VQA due to the lack of proper benchmarks, challenging models known to be performant in other languages. Furthermore, existing Bangla VQA datasets offer little regional relevance and are largely adapted from their foreign counterparts. To address these challenges, we introduce a large-scale Bangla VQA dataset, ChitroJera, totaling over 15k samples from diverse and locally relevant data sources. We assess the performance of text encoders, image encoders, multimodal models, and our novel dual-encoder models. The experiments reveal that the pre-trained dual-encoders outperform other models of their scale. We also evaluate the performance of current large vision language models (LVLMs) using…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

pltops/chitroJera
dataset· 7 dl
7 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning