Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

Yishun Lu; Wes Armour

arXiv:2605.16165·cs.CV·May 18, 2026

Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models

Yishun Lu, Wes Armour

PDF

TL;DR

This paper introduces ML-FOP-SOAP, a second-order optimization method with variance correction that stabilizes multimodal training and improves efficiency in large-batch settings.

Contribution

It proposes a novel second-order optimizer with multi-level variance correction, enhancing stability and efficiency in multimodal model training.

Findings

01

Achieves stable training at batch size 8192.

02

Improves sample efficiency by up to 1.4x.

03

Speeds up training by up to 1.5x.

Abstract

Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.