Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal   LLMs

Zheqi Lv; Wenkai Wang; Jiawei Wang; Shengyu Zhang; Fei Wu

arXiv:2501.05662·cs.CL·March 18, 2025

Cascaded Self-Evaluation Augmented Training for Lightweight Multimodal LLMs

Zheqi Lv, Wenkai Wang, Jiawei Wang, Shengyu Zhang, Fei Wu

PDF

Open Access

TL;DR

This paper introduces Cas-SEAT, a novel training method for lightweight multimodal LLMs that enhances self-evaluation and reasoning abilities through cascaded prompts and data filtering, leading to significant performance improvements.

Contribution

The paper proposes Cas-SEAT, a cascaded self-evaluation training approach with double-level data filtering, to improve reasoning and self-evaluation in lightweight multimodal LLMs.

Findings

01

Cas-SEAT improves model performance by 22.16% on average across datasets.

02

DDF reduces training resource consumption significantly.

03

Enhanced self-evaluation capabilities lead to better reasoning accuracy.

Abstract

Efficient Multimodal Large Language Models (EMLLMs) can improve performance through Chain-of-Thought (CoT) reasoning, but they have poor self-evaluation capabilities during the CoT reasoning process. This is due to their tendency to simplify the reasoning process and the degradation of self-evaluation ability during downstream task fine-tuning. To address this, we intuitively propose \textit{Self-Evaluation Augmented Training (SEAT)}, which uses more powerful EMLLMs to evaluate CoT reasoning data. The evaluation data is then used to train EMLLMs. However, due to the difficulties EMLLMs face with processing long token input-output sequences, and the degradation of self-evaluation ability as a basis for CoT reasoning, the SEAT method is not fully adapted. Therefore, we further propose \textit{Cascaded Self-Evaluation Augmented Training (Cas-SEAT)}, which converts long prompts into…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Speech Recognition and Synthesis