Chitrarth: Bridging Vision and Language for a Billion People
Shaharukh Khan, Ayush Tarun, Abhinav Ravi, Ali Faraz, Akshat Patidar,, Praveen Kumar Pokala, Anagha Bhangare, Raja Kolla, Chandra Khatri, Shubham, Agarwal

TL;DR
Chitrarth is a multilingual vision-language model designed for 10 Indian languages, achieving state-of-the-art results on benchmarks and fostering inclusive AI for diverse linguistic communities.
Contribution
The paper introduces Chitrarth, a novel multilingual vision-language model tailored for Indian languages, and BharatBench, a new evaluation framework for such models.
Findings
Achieves SOTA results on low-resource Indian languages
Maintains efficiency in English language tasks
Provides a comprehensive benchmark for Indian language VLMs
Abstract
Recent multimodal foundation models are primarily trained on English or high resource European language data, which hinders their applicability to other medium and low-resource languages. To address this limitation, we introduce Chitrarth (Chitra: Image; Artha: Meaning), an inclusive Vision-Language Model (VLM), specifically targeting the rich linguistic diversity and visual reasoning across 10 prominent Indian languages. Our model effectively integrates a state-of-the-art (SOTA) multilingual Large Language Model (LLM) with a vision module, primarily trained on multilingual image-text data. Furthermore, we also introduce BharatBench, a comprehensive framework for evaluating VLMs across various Indian languages, ultimately contributing to more diverse and effective AI systems. Our model achieves SOTA results for benchmarks across low resource languages while retaining its efficiency in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training
