Frequency-Modulated Visual Restoration for Matryoshka Large Multimodal Models
Qingtao Pan, Zhihao Dou, Shuo Li

TL;DR
FMVR is a simple, frequency-based visual restoration method that enhances large multimodal models' reasoning by preserving and restoring visual semantics with fewer tokens, reducing computational costs significantly.
Contribution
Introducing FMVR, a novel frequency-modulated visual restoration technique that improves visual semantic preservation and model efficiency in large multimodal models.
Findings
Reduces FLOPs of LLaVA-1.5-7B by 89%.
Maintains nearly 100% of original accuracy.
Effective across multiple image and video benchmarks.
Abstract
Large Multimodal Models (LMMs) struggle to adapt varying computational budgets due to numerous visual tokens. Previous methods attempted to reduce the number of visual tokens before or within LLMs. However, these strategies inevitably result in the loss of visual semantic. To address these issues, we introduce FMVR, a plug-and-play and extremely simple Frequency-Modulated Visual Restoration strategy to boost the reasoning ability of LMMs under visual token reduction. Specifically, FMVR disentangles the visual representation of fewer visual tokens into low- and high-frequency components through AvgPool and MaxPool. The derived frequencies are subsequently modulated using lightweight learnable parameters. The high-frequency from AvgPool acts as a saliency filter to enhance saliency visual semantics, while the low-frequency from MaxPool acts as an anti-saliency filter to strengthen weak…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection
