Visual Question Answering Instruction: Unlocking Multimodal Large   Language Model To Domain-Specific Visual Multitasks

Jusung Lee; Sungguk Cha; Younghyun Lee; Cheoljong Yang

arXiv:2402.08360·cs.CV·February 14, 2024·2 cites

Visual Question Answering Instruction: Unlocking Multimodal Large Language Model To Domain-Specific Visual Multitasks

Jusung Lee, Sungguk Cha, Younghyun Lee, Cheoljong Yang

PDF

Open Access

TL;DR

This paper introduces a method to adapt multimodal large language models for domain-specific visual tasks by converting datasets into a question answering format, enabling effective multitask performance.

Contribution

The authors developed VQA-IN, a unified question answering format, to extend multimodal LLMs for domain-specific visual tasks, demonstrating improved performance.

Findings

01

High scores on domain-specific visual tasks

02

Maintains performance on vision-language multitasks

03

Effective adaptation of MLLMs for specialized domains

Abstract

Having revolutionized natural language processing (NLP) applications, large language models (LLMs) are expanding into the realm of multimodal inputs. Owing to their ability to interpret images, multimodal LLMs (MLLMs) have been primarily used for vision-language tasks. Currently, MLLMs have not yet been extended for domain-specific visual tasks, which require a more explicit understanding of visual information. We developed a method to transform domain-specific visual and vision-language datasets into a unified question answering format called Visual Question Answering Instruction (VQA-IN), thereby extending MLLM to domain-specific tasks. The VQA-IN was applied to train multiple MLLM architectures using smaller versions of LLMs (sLLMs). The experimental results indicated that the proposed method achieved a high score metric on domainspecific visual tasks while also maintaining its…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems