mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Gregor Geigle, Abhay Jain, Radu Timofte, Goran Glava\v{s}

TL;DR
mBLIP introduces a computationally efficient method to adapt multilingual vision-language models using machine-translated data, achieving competitive results without extensive end-to-end multilingual pretraining.
Contribution
The paper presents the first multilingual Vision-LLM that is efficiently aligned with multilingual LLMs using limited data and without costly pretraining.
Findings
mBLIP performs competitively on IGLUE and XM3600 benchmarks.
It significantly outperforms English-only Vision-LLMs like Llava 1.5.
The approach reduces computational costs for multilingual Vision-LLMs.
Abstract
Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million…
Peer Reviews
Decision·Submitted to ICLR 2024
The approach is interesting in the sense that it covers a reasonable use case, that of multi-lingual LVLM by aligning frozen models. It seems like the authors found a good space where a solution did not exist in the literature. The proposed approach seems technically reasonable. The paper is mostly clear, the target application is clear and the explanations are usually clear (although some issues are flagged later on). Results are positive, although there isn't a lot to compare against.
The paper does not seem to have a strong technical contribution. On the pro side, I like the application, and there's some analysis of the impact of the different pieces in Table 3, which offsets somewhat the lack of technical contribution. I am a bit unsure about the impact of LoRA. A task prompt has much lower capacity to adapt to any downstream task compared to adding LoRA (especially if LoRA adapters are added to every 1x1 layer). So I'm wondering if part of the performance is due to LoRA v
mBLIP is created using only about 2.5 million images and training 124 million parameters on consumer hardware to convert high-quality English data into 95 languages for training. The contribution of this paper is to presents a cost-effective approach to developing multilingual vision-language models with broad language coverage and strong performance.
There are still big differences of accuracy between English and other languages on XM3600 and xFlickerCo testing. In addition, Table 2 is not so clean to understand.
- The work proposes a dataset (if will be released) with image-text pairs coming from 96 different languages, which could be useful for future research. - Experiments are decent. The work also covers some important ablation studies.
- The major concern is the novelty. In short, the work can be summarized as (1) using machine translation to translate image-text pairs (in English) to different languages, and (2) replace the language decoder in BLIP2 with a multi-lingual one and train the model with translated dataset. - The mixture of task highlighted in the paper as a contribution to the success of the model, although not considered in BLIP2, was actually explored in PALI. This fact further deteriorates the novelty of this w
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Natural Language Processing Techniques
MethodsFocus · ALIGN
