VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

Jeongho Ju; Daeyoung Kim; SunYoung Park; Youngjune Kim

arXiv:2411.19103·cs.CV·December 2, 2024

VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models

Jeongho Ju, Daeyoung Kim, SunYoung Park, Youngjune Kim

PDF

Open Access 3 Models 5 Datasets

TL;DR

VARCO-VISION is a new Korean-English vision-language model that effectively learns bilingual visual and linguistic information, demonstrating strong performance across diverse tasks and providing new datasets for evaluation.

Contribution

We introduce VARCO-VISION, an open-source bilingual VLM with a novel training strategy that preserves knowledge and expands capabilities, along with new Korean evaluation datasets.

Findings

01

Outperforms similar-sized models in bilingual image-text tasks

02

Capable of grounding, referring, and OCR functions

03

Provides new benchmarks for Korean vision-language understanding

Abstract

In this paper, we introduce an open-source Korean-English vision-language model (VLM), VARCO-VISION. We incorporate a step-by-step training strategy that allows a model learn both linguistic and visual information while preserving the backbone model's knowledge. Our model demonstrates outstanding performance in diverse settings requiring bilingual image-text understanding and generation abilities compared to models of similar size. VARCO-VISION is also capable of grounding, referring, and OCR, expanding its usage and potential applications for real-world scenarios. In addition to the model, we release five Korean evaluation datasets, including four closed-set and one openset benchmarks. We anticipate that our milestone will broaden the opportunities for AI researchers aiming to train VLMs. VARCO-VISION is available at https://huggingface.co/NCSOFT/VARCO-VISION-14B.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedia, Religion, Digital Communication · Educational Systems and Policies