Enhancing Subtask Performance of Multi-modal Large Language Model
Yongqiang Zhao, Zhenyu Li, Feng Zhang, Xinhai Xu, Donghong Liu

TL;DR
This paper proposes a method to improve multi-modal large language models by selecting and combining results from multiple pre-trained models for each subtask, leading to enhanced overall performance.
Contribution
It introduces a novel approach of parallel subtask processing with multiple models and result selection via LLM, which improves MLLM effectiveness.
Findings
The proposed method outperforms baseline models on multiple datasets.
Using multiple models for the same subtask enhances accuracy.
Experimental results validate the effectiveness of the approach.
Abstract
Multi-modal Large Language Model (MLLM) refers to a model expanded from a Large Language Model (LLM) that possesses the capability to handle and infer multi-modal data. Current MLLMs typically begin by using LLMs to decompose tasks into multiple subtasks, then employing individual pre-trained models to complete specific subtasks, and ultimately utilizing LLMs to integrate the results of each subtasks to obtain the results of the task. In real-world scenarios, when dealing with large projects, it is common practice to break down the project into smaller sub-projects, with different teams providing corresponding solutions or results. The project owner then decides which solution or result to use, ensuring the best possible outcome for each subtask and, consequently, for the entire project. Inspired by this, this study considers selecting multiple pre-trained models to complete the same…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Dropout · Adam · Position-Wise Feed-Forward Layer · Residual Connection · Label Smoothing · Softmax
