TL;DR
MergeMix introduces a novel augmentation paradigm that combines supervised fine-tuning and reinforcement learning for better alignment and generalization in multi-modal large language models, using an efficient Token Merge based Mixup technique.
Contribution
It proposes MergeMix, a unified framework that bridges SFT and RL with a Token Merge based Mixup, enhancing training efficiency, stability, and multi-modal alignment.
Findings
Achieves superior classification accuracy as an augmentation method.
Improves generalization abilities of MLLMs.
Enhances alignment and preference learning stability.
Abstract
Vision-language alignment in multi-modal large language models (MLLMs) relies on supervised fine-tuning (SFT) or reinforcement learning (RL). To align multi-modal large language models (MLLMs) in the post-training stage, supervised fine-tuning (SFT) is a stable choice but requires human annotations and lacks task generalizations, while Reinforcement Learning (RL) searches for better answers from reward signals but suffers from computational overhead and instability. To achieve balance among scalability, efficiency, and alignment generalizations, we propose MergeMix, a unified paradigm that bridges SFT and RL with an efficient Token Merge based Mixup augmentation. As for the Mixup policy, we generate contextual aligned mixed images with the corresponding labels according to the merged attention maps with cluster regions. Then, we enhance the preference-driven paradigm for MLLMs by…
Peer Reviews
Decision·ICLR 2026 Poster
1.The introduction of a token merge–based mixing mechanism and attention recovery via bipartite soft matching is novel within the mixup literature. Compared to heuristic or random masking, the method provides a more structured way to preserve salient regions during interpolation. 2.The token-merge + mixup combination is reasonable, and the design (Top-K attention selection, λ re-scaling, ranking loss) can be integrated into existing ViT and MLLM frameworks with minimal modifications.The authors
1.The paper’s organization hinders readability. The introduction directly dives into technical detail without motivating the gap, and the related work section is largely enumerative rather than analytical. Notation is inconsistent and transitions are abrupt, making it difficult to follow the method’s rationale. 2.The paper mixes several technical ideas—token merging, mixup, λ re-scaling, and ranking loss—without a clear unifying formulation. It is unclear how the policy P(·,·) determines masks,
* A novel image mixing augmentation method is proposed, which demonstrates significant improvements across multiple datasets. * The mixed images are directly used as the "loser" in a pairwise ranking setup via SimPO, eliminating the cost and potential bias of training a separate Reward Model (RM) and simplifying the pipeline. * Extensive experiments are conducted, providing multi-faceted validation of the method's effectiveness.
* The application of MergeMix to image classification and MLLM alignment tasks shows some innovation, but the degree of novelty is limited. * The assumption that attention-based merged images are inherently of lower quality than original images lacks substantiating evidence. While the paper discusses the method from an MLLM perspective, validation is only conducted in the visual modality, leaving its efficacy in other modalities unexplored. * The performance drop on MMBench and MathVista afte
1. Quality: The experimental setup is rigorous, with thorough benchmarking and ablation studies. 2. Clarity: The methodology and results are clearly explained. 3. Significance: The approach demonstrates practical improvements in accuracy, calibration, and efficiency. 4. Applicability: MergeMix is shown to be effective across both image classification and multi-modal tasks.
1. Limited impact on multi-modal tasks: The gains for MLLMs are marginal, suggesting the method’s strengths are domain-specific. 2. Outdated baselines: Most compared methods in classification tasks are from two years ago, which may not represent the latest advances. 3. Scope of contribution: The paper could better clarify its impact boundaries. 4. Discussion of limitations: More explicit discussion of why the method is less effective for multi-modal tasks and why recent baselines were not includ
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
