Latent Denoising Improves Visual Alignment in Large Multimodal Models

Dhruv Parikh; Jacob Fein-Ashley; Rajgopal Kannan; Viktor Prasanna

arXiv:2604.21343·cs.CV·April 24, 2026

Latent Denoising Improves Visual Alignment in Large Multimodal Models

Dhruv Parikh, Jacob Fein-Ashley, Rajgopal Kannan, Viktor Prasanna

PDF

1 Repo

TL;DR

This paper introduces a latent denoising framework that enhances visual representation and reasoning in large multimodal models by training them to recover clean visual features from corrupted inputs, leading to improved robustness and understanding.

Contribution

The authors propose a novel latent denoising method that improves internal visual feature alignment and multimodal understanding in LMMs without increasing inference complexity.

Findings

01

Consistent improvement in visual understanding and reasoning benchmarks.

02

Enhanced robustness to common image corruptions.

03

Better compositional robustness on benchmarks like NaturalBench.

Abstract

Large Multimodal Models (LMMs) such as LLaVA are typically trained with an autoregressive language modeling objective, providing only indirect supervision to visual tokens. This often yields weak internal visual representations and brittle behavior under distribution shift. Inspired by recent progress on latent denoising for learning high-quality visual tokenizers, we show that the same principle provides an effective form of visual supervision for improving internal visual feature alignment and multimodal understanding in LMMs. We propose a latent denoising framework that corrupts projected visual tokens using a saliency-aware mixture of masking and Gaussian noising. The LMM is trained to denoise these corrupted tokens by recovering clean teacher patch features from hidden states at a selected intermediate LLM layer using a decoder. To prevent representation collapse, our framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dhruvashp/latent-denoising-for-lmms
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.