Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data
Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu

TL;DR
This paper introduces a self-synthesized data approach with a visual rejection sampling framework to enhance the cognition and explainability of large multimodal models, especially in fine-grained visual reasoning tasks.
Contribution
It proposes a novel iterative fine-tuning method using self-generated interpretable answers and a filtering mechanism to improve model accuracy and explainability.
Findings
Improved accuracy in visual classification tasks.
Enhanced explainability of model predictions.
Effective iterative synthetic data generation process.
Abstract
Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address the above challenge, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, and carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗YuchengShi/LLaVA-v1.5-7B-CUB-200model· 7 dl· ♡ 37 dl♡ 3
- 🤗YuchengShi/LLaVA-v1.5-7B-Fgvcmodel· 7 dl· ♡ 17 dl♡ 1
- 🤗YuchengShi/LLaVA-v1.5-7B-HAM10000model· 7 dl7 dl
- 🤗YuchengShi/LLaVA-v1.5-7B-Plant-Leaf-Diseases-Detectionmodel· 186 dl· ♡ 8186 dl♡ 8
- 🤗YuchengShi/LLaVA-v1.5-7B-Stanford-Dogsmodel· 1 dl1 dl
- 🤗YuchengShi/llava-med-v1.5-mistral-7b-chest-xraymodel· 23 dl· ♡ 323 dl♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
