Enhancing Cognition and Explainability of Multimodal Foundation Models   with Self-Synthesized Data

Yucheng Shi; Quanzheng Li; Jin Sun; Xiang Li; Ninghao Liu

arXiv:2502.14044·cs.CV·February 26, 2025

Enhancing Cognition and Explainability of Multimodal Foundation Models with Self-Synthesized Data

Yucheng Shi, Quanzheng Li, Jin Sun, Xiang Li, Ninghao Liu

PDF

Open Access 1 Repo 6 Models

TL;DR

This paper introduces a self-synthesized data approach with a visual rejection sampling framework to enhance the cognition and explainability of large multimodal models, especially in fine-grained visual reasoning tasks.

Contribution

It proposes a novel iterative fine-tuning method using self-generated interpretable answers and a filtering mechanism to improve model accuracy and explainability.

Findings

01

Improved accuracy in visual classification tasks.

02

Enhanced explainability of model predictions.

03

Effective iterative synthetic data generation process.

Abstract

Large Multimodal Models (LMMs), or Vision-Language Models (VLMs), have shown impressive capabilities in a wide range of visual tasks. However, they often struggle with fine-grained visual reasoning, failing to identify domain-specific objectives and provide justifiable explanations for their predictions. To address the above challenge, we propose a novel visual rejection sampling framework to improve the cognition and explainability of LMMs using self-synthesized data. Specifically, visual fine-tuning requires images, queries, and target answers. Our approach begins by synthesizing interpretable answers that include human-verifiable visual features. These features are based on expert-defined concepts, and carefully selected based on their alignment with the image content. After each round of fine-tuning, we apply a reward model-free filtering mechanism to select the highest-quality…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sycny/selfsynthx
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies