EXaMCaP: Subset Selection with Entropy Gain Maximization for Probing Capability Gains of Large Chart Understanding Training Sets
Jiapeng Liu, Liang Li, Bing Li, Peng Fu, Xiyan Gao, Chengyang Fang, Xiaoshuai Hao, Can Ma

TL;DR
This paper introduces EXaMCaP, a subset selection method based on entropy gain maximization, to efficiently probe the capability gains of large chart understanding training sets for multimodal large language models.
Contribution
It proposes a novel subset selection approach that maximizes entropy gain to evaluate and enhance MLLMs' performance with reduced fine-tuning costs.
Findings
EXaMCaP outperforms baseline methods in probing capability gains.
The method is effective across various subset sizes.
It is compatible with different MLLM architectures.
Abstract
Recent works focus on synthesizing Chart Understanding (ChartU) training sets to inject advanced chart knowledge into Multimodal Large Language Models (MLLMs), where the sufficiency of the knowledge is typically verified by quantifying capability gains via the fine-tune-then-evaluate paradigm. However, full-set fine-tuning MLLMs to assess such gains incurs significant time costs, hindering the iterative refinement cycles of the ChartU dataset. Reviewing the ChartU dataset synthesis and data selection domains, we find that subsets can potentially probe the MLLMs' capability gains from full-set fine-tuning. Given that data diversity is vital for boosting MLLMs' performance and entropy reflects this feature, we propose EXaMCaP, which uses entropy gain maximization to select a subset. To obtain a high-diversity subset, EXaMCaP chooses the maximum-entropy subset from the large ChartU…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Machine Learning and Data Classification · Machine Learning in Healthcare
