Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
Khang T. Doan, Bao G. Huynh, Dung T. Hoang, Thuc D. Pham, Nhat H., Pham, Quan T.M. Nguyen, Bang Q. Vo, Suong N. Hoang

TL;DR
Vintern-1B is a 1-billion-parameter multimodal language model optimized for Vietnamese, integrating visual and language models to perform OCR, document extraction, and question-answering with high accuracy and on-device suitability.
Contribution
We introduce Vintern-1B, a novel multimodal Vietnamese language model combining language and visual components, fine-tuned on extensive datasets, and open-sourcing Vietnamese VQA datasets.
Findings
Achieved robust performance on Vietnamese benchmarks like OpenViVQA and ViTextVQA.
Successfully integrated language and visual models for multimodal tasks.
Model is compact enough for on-device applications.
Abstract
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗5CD-AI/Vintern-1B-v2model· 651 dl· ♡ 81651 dl♡ 81
- 🤗5CD-AI/Vintern-1B-v3_5model· 6.3k dl· ♡ 1156.3k dl♡ 115
- 🤗tt1225/Vintern-1B-v2-Custommodel· 2 dl2 dl
- 🤗5CD-AI/Vintern-3B-betamodel· 377 dl· ♡ 37377 dl♡ 37
- 🤗YuukiAsuna/Vintern-1B-v2-ViTable-docvqamodel· 7 dl· ♡ 27 dl♡ 2
- 🤗5CD-AI/Vintern-3B-R-betamodel· 11k dl· ♡ 2111k dl♡ 21
- 🤗Mungert/Vintern-1B-v3_5-GGUFmodel· 8 dl8 dl
- 🤗innomation/Vintern-1B-v3_5_Evalmodel· 1 dl1 dl
- 🤗enzosama/Vintern-1B-v3_5model
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
