LLaVA-FA: Learning Fourier Approximation for Compressing Large Multimodal Models
Pengcheng Zheng, Chaoning Zhang, Jiarong Mo, GuoHui Li, Jiaquan Zhang, Jiahao Zhang, Sihan Cao, Sheng Zheng, Caiyan Qin, Guoqing Wang, Yang Yang

TL;DR
LLaVA-FA introduces a Fourier-based joint low-rank and quantization approach for compressing large multimodal models, significantly reducing computational costs while maintaining high performance.
Contribution
It proposes a novel Fourier domain approximation method combined with PolarQuant and diagonal calibration to improve compression of multimodal models.
Findings
Outperforms existing models on multiple benchmarks.
Achieves high compression with minimal accuracy loss.
Reduces computational and memory costs significantly.
Abstract
Large multimodal models (LMMs) have achieved impressive performance on various vision-language tasks, but their substantial computational and memory costs hinder their practical deployment. Existing compression methods often decouple low-rank decomposition and quantization, leading to compounded reconstruction errors, especially in multimodal architectures with cross-modal redundancy. To address this issue, we propose LLaVA-FA, a novel efficient LMM that performs joint low-rank plus quantization approximation in the frequency domain. By leveraging the de-correlation and conjugate symmetry properties of Fourier transform, LLaVA-FA achieves more compact and accurate weight representations. Furthermore, we introduce PolarQuant, a polar-coordinate quantization method tailored for complex matrices, and an optional diagonal calibration (ODC) scheme that eliminates the need for large-scale…
Peer Reviews
Decision·ICLR 2026 Poster
This paper proposes LLaVA-FA, a compression framework for large multimodal models (LMMs) that addresses the limitations of existing methods, where decoupled low-rank decomposition and quantization often lead to compounded reconstruction errors. LLaVA-FA employs Fourier approximation to integrate low-rank decomposition and quantization within the frequency domain, leveraging two essential properties of the Fourier transform: de-correlation, which reduces spectral redundancy, and conjugate symmetr
1. The paper proposes performing low-rank decomposition and quantization in the frequency domain via the Fourier transform, but the experimental section lacks comparisons with existing solutions in the spatial domain, limiting the credibility of its claimed competitiveness. 2. The method shows limited generalization capability, as experiments are only conducted on 3B and 7B-scale LLMs (Qwen-2.5) without evaluation on larger-parameter models.
1. The proposed LLaVA-FA is well-motivated, and supported by a clear theoretical framing. 2. Efficiency evidence with concrete measurements. Specifically, latency, FLOPs, KV-Cache usage, and TTFT are reported, which aligns well with the goals of a model compression study.
1. Limited ablation for ODC and calibration choices. The paper mentions that ODC removes the need for large calibration sets, but no direct comparison or ablation is presented to isolate this effect. Adding such results would make the claim more convincing. 2. There are some fairness concerns regarding the baseline comparisons in Table 1. The amount of training data varies across methods, and some baselines use fewer samples than LLaVA-FA while achieving comparable performance. 3. Discussion o
1. This paper is written in a very high quality, the figures analyze the problem and illustrate the idea very clearly, especially Figure 1 and 3. 2. I think this paper targets on an important problem, the efficient LMM. 3. The idea is interesting and reasonable. I am happy to see Fourier approximation can be applied to LMM since it really has some good characteristics. 4. The experiments are abundant and clear, proving the effectiveness of the proposed method.
1. Just one discussion. This paper choose Fourier approximation, and can we consider other type of approximation? I am happy to see more comparison results. 2. The authors can have more discussions about the limitations and future work.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Speech and Audio Processing
