Second-Order Multi-Level Variance Correction for Modality Competition in Multimodal Models
Yishun Lu, Wes Armour

TL;DR
This paper introduces ML-FOP-SOAP, a second-order optimization method with variance correction that stabilizes multimodal training and improves efficiency in large-batch settings.
Contribution
It proposes a novel second-order optimizer with multi-level variance correction, enhancing stability and efficiency in multimodal model training.
Findings
Achieves stable training at batch size 8192.
Improves sample efficiency by up to 1.4x.
Speeds up training by up to 1.5x.
Abstract
Autoregressive next-token training offers a unified formulation for image generation and text understanding, but it also creates strong modality competition that destabilizes optimization and limits large-batch scaling. We show that first-order optimizers such as AdamW are vulnerable to cross-modality gradient heterogeneity, while second-order preconditioning, particularly SOAP, provides a more stable basis for multimodal alignment. Building on this insight, we propose \emph{ML-FOP-SOAP}, a second-order optimization framework with Multi-Level Variance Correction. Our Fisher-Orthogonal Projection suppresses variance-induced modality conflicts, reducing the trade-off between visual generation and textual understanding. To make this practical under large gradient accumulation, we introduce a hierarchical folding strategy that captures fine-grained variance with low micro-step overhead.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
